Published IRL for URL parsing #4

sethmlarson · 2019-07-09T14:05:24Z

NOTE: This discussion started within Team discussion before it was pointed out that the public could not see the discussion thread

Would love some eyes on and thoughts about irl which is essentially urllib3's URL parser in it's own tiny library. Was in the middle of refactoring to remove the rfc3986 dependency from urllib3 due to it's slow import time and wanted to make the work available for others (including http3) to use.

sethmlarson · 2019-07-09T14:08:11Z

@tomchristie:
Looks like an impressive chunk of work behind that!

If the http3 work was going to use it there's a couple of different options:

Just use it behind the scenes, but wrap it up in http3's prefered API.
Make sure the the irl library is using the same external API that we'd want to see whenever we expose URL information from http3 in the first place.

Which of those two are you thinking of here?

@tomchristie:
Okay, so I see that this is a completely new pypi project, rather than something with an existing install base, so I guess we're free to treat the API as a fresh start, which would be ace.

Could we name this project urls perhaps? https://pypi.org/project/urls/

I absolutely have some thoughts on how we might want the API to look throughout, having worked on similar API surface areas in starlette, and then http3 - it'd be lovely if we could end up with one fairly definitive package for URL models that can get reused in all sorts of different higher level libraries. Very keen on pursuing this. 🌟

@njsmith:
I dunno, I think irl is a pretty brilliant name for a library designed to parse real-world URLs :-)

There are a ton of URL packages already though: rfc3986, hyperlink (extracted from twisted), yarl (extracted from aiohttp), and that's just the modern maintained ones I know off the top of my head, there are at least half a dozen older ones. Is there a 2-second summary of what unique niche irl is filling?

@tomchristie:
Personally I would be in favor if it means we can see http3 depend on a working-group-blessed URL lib that nails the API (yarl looks like it gets it pretty much right), and comprehensivly enough deals with niggly cases that I just don't need to think about it. As it stands http3 wraps stuff up in it's own URL API (over rfc3986), which I'm also okay with, but doesn't seem optimal if we're trying to take a collective approach.

(An alternative would be to have http3 not expose any parsed URL interface, and only ever expose a raw string URL)

@tomchristie:
Any chance irl could get transfered to a personal repo 'til we've chatted stuff through? 😉

@njsmith:
The biggest downside of yarl for me is that it's written in Cython, so it needs a compiler, system-specific wheels, can't be used by anything in pip's dependency chain, etc.

I've found hyperlink pleasant to work with, though I have no idea how compatible it is with the urllib3 test cases. And I appreciate that its docs are clear about unicode... that's something I've never been clear on with rfc3986 or yarl.

I guess the crucial spec for an HTTP client is https://url.spec.whatwg.org/

Oh heh, and apparently @sethmlarson already implemented it in Python? https://pypi.org/project/whatwg-url/

@tomchristie:

The biggest downside of yarl for me is that it's written in Cython

Okay that's a blocker from my POV, yes.

@sethmlarson:
Alright good morning all, time to answer all your questions:

Any chance irl could get transfered to a personal repo 'til we've chatted stuff through?

Done

Which of those two [use-cases] are you thinking of here?

To be wrapped by http3 and whatever client library uses it just like we did previously to rfc3986.

Is there a 2-second summary of what unique niche irl is filling?

The URL library that knows exactly which sections are capable of taking anything and letting the user do anything but still remaining secure and RFC 3986 compliant. Also securely handles a bunch of stuff that's totally not RFC 3986 compliant but still happens in the wild.

I'm currently writing a reply for a feature-set comparison but it'll take some time so I'll let you have these replies first. :)

@tomchristie:

To be wrapped by http3 and whatever client library uses it just like we did previously to rfc3986.

Gotcha! Well that keeps things simple.

There’s a related design question for http3, here... encode/httpx#113

@sethmlarson:
yarl

Compiled with Cython
Can handle percent-encoding characters in invalid components.
Properly handles extra @s in userinfo
IPv6 Zone IDs: Doesn't really do anything with them.
No validation of host / port
Completely fails on schemeless URLs. (www.google.com:80 is parsed as scheme=www.google.com)
Doesn't emit RFC 3986 compliant URLs
Doesn't try to normalize hosts, left to the user.

hyperlink

Pure Python
Can handle invalid characters in components but won't percent-encode them (?)
Doesn't handle extra @ in userinfo (errors out on parse)
Can't handle IPv6 Zone IDs (errors out on parse) Seems to be passing the host directly to inet_pton.
Will emit invalid targets and targets susceptible to CRLF injection.
Doesn't try to normalize hosts, left to the user.
Strips off IPv6 brackets in URL.host before you can use them for Host/:authority.

rfc3986

Very RFC 3986 compliant. Validates everything
Can't handle multiple @ in userinfo
Won't accept anything that's not supposed to be in a certain component, so no percent-encoding.
Requires a lot of expensive regexes to handle IRIs (RFC 3987) properly so takes multiple seconds to import on Raspberry Pi.
Normalizes hosts properly.

whatwg-url

Defined by a completely different standard than all other Python URL parsing. Can cause SSRF vulnerabilities between parsers.
Gets the multiple @ in userinfo right
Doesn't know anything about IPv6 Zone IDs (errors out when parsing)
Errors out on schemeless URLs

Plus I will say that because it's design specifically for HTTP clients there are a few helpers that are non-trivial to implement yourself. There's another two I'm planning on adding for redirect_location(location) which is a URL join and some additional checking that it didn't specify an IPv6 Zone ID (Need to see what Chrome and Firefox's behavior on this is) and I'd like to add a simple .origin() and Origin() class that gets the null-origin comparison correct.

@sethmlarson:
Basically the TLDR is:

Schemeless URLs happen in the wild so need to be handled.
Proxies require users to use email addresses for usernames within userinfo so user@example.com:password@... is the construction.
Users don't like percent-encoding characters they type into http3.get(), we have to do that for them.
IPv6 Zone IDs are used in real life but shouldn't be a part of anything at the HTTP layer.
Hosts can't just be lowercased: IPv6 Zone IDs, http+unix are both case-sensitive.
RFC 3986 compliance is recommended to avoid SSRF as the rest of the Python-verse is using RFC 3986.
Import performance is an issue on smaller devices if massive or complicated regexes are required.
irl has little helpers that a client library should just use to get Host/:authority and target/:path.

sethmlarson · 2019-07-09T17:27:15Z

I'll mention here explicitly that I have no problem at all changing anything about the interface. I basically worked in a frenzy over an hour or so last night so I'm not attached to anything yet. ;)

njsmith · 2019-07-11T03:26:07Z

I'm not sure I find the SSRF argument convincing... it's definitely true that if you use library A to validate a URL against SSRF, and then library B to actually fetch the URL, and libraries A and B disagree on some parsing details, then that might create holes in your SSRF protection. But:

You just gave a big list of places where these different libraries all have slightly different parsing. (And haven't even mentioned the stdlib urlparse function, which has totally wacky semantics but people use it anyway.) If people want to parse things the same way as their HTTP client does, then they need to make sure to use exactly the same parsing library that their HTTP client does.
Validating URLs can't provide solid protection against SSRF anyway, because even if I block URLs like http://127.0.0.1 and http://localhost, that doesn't mean an attacker won't trick me into connecting to http://totally-innocent-domain.com, whose DNS entry happens to resolve to 127.0.0.1. I think the right way to protect against SSRF is to provide a way to validate IP addresses just after DNS resolution, which means it needs to be baked into the client itself (see also: Should we have a way to block open_tcp_stream from connecting to certain hosts? python-trio/trio#479)

I'm surprised that WHATWG-URLs can't be schemaless... obviously every browser understands schemaless URLs! Do they expect user-agents to do some extra non-standardized normalization or what?

sethmlarson · 2019-07-11T17:00:46Z

Yeah the parser difference SSRF argument is one that's just been around and our primary motivator towards making urllib3's URL parser compliant with RFC 3986. This presentation specifically if anyone hasn't seen it: https://www.blackhat.com/docs/us-17/thursday/us-17-Tsai-A-New-Era-Of-SSRF-Exploiting-URL-Parser-In-Trending-Programming-Languages.pdf

I think we're trying to make it so you don't have to do the URL parsing yourself by giving the user an already parsed URL with components instead of a raw string.
You're definitely right, this is something we should be aware of on the client level.
They must be doing extra to detect whether a URL starts with a scheme similar to how we're doing it for urllib3? None of the test cases within urltestdata.json have a "protocol": "" expected value.

njsmith · 2019-07-27T18:30:21Z

I read a bit more about the WHATWG URL spec and I think I get what's going on.

The spec is all about how you handle URLs that appear over the network: in HTML tags like <a> and <img>, or in HTTP headers like Location:, because those are the cases where it's important all the browsers act consistently. In all of these cases, you always have a base URL that the new URL is resolved against. And that means schemaless URLs actually can't happen. <a href="google.com"> is just like <a href="index.html"> – it's a relative reference to a file in the same directory. Any resemblance to a giant internet corporation is purely coincidental.

OTOH, there are cases where you have a URL but there's no base URL to resolve against, like when someone types a URL into a browser location bar, or calls requests.get("..."). In this case, google.com can't be a relative URL; it's just meaningless. So if a particular interface wants to invent some meaning for it, like by inserting a scheme, then it's free to do so.

Anyway, what does this mean for us?

An HTTP client has to follow redirects, and it probably should do this the same way browsers do, which to me seems like a pretty compelling argument for using WHATWG rules there. Resolving links like a browser does is also a good thing to support; it's not quite in scope for a HTTP client itself, but lots of users will want to do this and it'd be nice if they could use the same library.

But we also have to handle URLs where there's no base URL, like in requests.get("..."). So maybe we also want a mode that treats this specially and does some heuristics to handle schemeless URLs?

When you say schemeless URLs appear "in the wild", what do you mean by that? People expect to be able to type them into interactive sessions, or something more than that?

sethmlarson · 2019-07-29T23:57:59Z

The big use-case is the typing into interactive shells, I agree we can remedy that case via heuristics to allow it at the client level.

I'll have to take a closer look at WHATWG URL because it's requirements for IPv6 hosts specifically is different than RFC 3986 because it doesn't support zone IDs. I know that those are used frequently, there's been a few issues on both requests and urllib3's tracker about them since we've modified our URL parser. Zone IDs make a lot less sense to a browser but to a programmatic HTTP client they're useful in some situations.

sethmlarson · 2019-07-30T00:46:09Z

cc @tomchristie would like your thoughts on this?

njsmith · 2019-07-30T01:57:44Z

doesn't support zone IDs

Sure, supporting zone IDs makes total sense.

My intuition is that for things browsers support, we want to do the same thing that browsers do. But for situations that browsers don't support, we can do whatever makes sense.

Also my sense is that there isn't really any ambiguity about how to support zone IDs, so if browsers do add support later then it'll match what we do anyway, right?

sethmlarson · 2019-07-30T02:05:01Z

there isn't really any ambiguity about how to support zone IDs

lol check out RFC 4007 Section 11 and RFC 6874, in which the authors of RFC 4007 forget you can't slap a % character into the host section.

Joking aside, we can handle them properly regardless of representation. :)

So should the IRL package become WHATWG-URL + stuff that browsers don't do yet but we need? Or should I add it to whatwg-url even though it breaks the promise a bit

sethmlarson added the request for comments Needs consensus from the team and external stakeholders label Jul 18, 2019

sethmlarson mentioned this issue Jul 30, 2019

Try connecting to HTTPS on schemaless URLs encode/httpx#124

Closed

sethmlarson closed this as completed Jan 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Published IRL for URL parsing #4

Published IRL for URL parsing #4

sethmlarson commented Jul 9, 2019

sethmlarson commented Jul 9, 2019 •

edited

Loading

sethmlarson commented Jul 9, 2019

njsmith commented Jul 11, 2019 •

edited

Loading

sethmlarson commented Jul 11, 2019

njsmith commented Jul 27, 2019

sethmlarson commented Jul 29, 2019

sethmlarson commented Jul 30, 2019

njsmith commented Jul 30, 2019

sethmlarson commented Jul 30, 2019 •

edited

Loading

Published IRL for URL parsing #4

Published IRL for URL parsing #4

Comments

sethmlarson commented Jul 9, 2019

sethmlarson commented Jul 9, 2019 • edited Loading

sethmlarson commented Jul 9, 2019

njsmith commented Jul 11, 2019 • edited Loading

sethmlarson commented Jul 11, 2019

njsmith commented Jul 27, 2019

sethmlarson commented Jul 29, 2019

sethmlarson commented Jul 30, 2019

njsmith commented Jul 30, 2019

sethmlarson commented Jul 30, 2019 • edited Loading

sethmlarson commented Jul 9, 2019 •

edited

Loading

njsmith commented Jul 11, 2019 •

edited

Loading

sethmlarson commented Jul 30, 2019 •

edited

Loading