"UTF-8 percent encode c using the path percent-e..." #296

Hixie · 2017-04-21T17:00:17Z

https://url.spec.whatwg.org/commit-snapshots/488c459d9e4245a3f6bf087e7dcd2c7e91487ac5/#url-parsing

UTF-8 percent encode c using the path percent-encode set, and append the result to buffer.

It's not at all clear when reading the parser how a path segment consisting of "%62[" should end up. If I'm reading it right, it should end up as "%25%36%32%5B", which doesn't seem right.

TimothyGu · 2017-04-21T17:10:35Z

None of %, 6, 2, [ are part of the path percent-encode set, so they will get appended to buffer verbatim.

zcorpan · 2017-04-21T17:51:28Z

Would it be clearer if the step

If codePoint is not in percentEncodeSet, then return codePoint.

was taken out of "UTF-8 percent encode" algorithm? And the places that call UTF-8 percent encode first check if c is in the relevant encode set?

Hixie · 2017-04-22T00:05:01Z

Oh wow yeah that's really unclear. The "UTF-8 percent encode" algorithm, in normal operation, generally does not encode? Very confusing. :-)

GPHemsley · 2017-06-11T19:59:50Z

The "UTF-8 percent encode" algorithm also appears to be unclear as to whether it's operating on code points or bytes. What's its return type?

To UTF-8 percent encode a codePoint, using a percentEncodeSet, run these steps:

If codePoint is not in percentEncodeSet, then return codePoint.

Let bytes be the result of running UTF-8 encode on codePoint.

Percent encode each byte in bytes, and then return the results concatenated, in the same order.

TimothyGu · 2017-06-12T00:29:56Z

Well codePoint should be self-evidently a code point. UTF-8 encode converts the code point into a byte sequence bytes. Percent encode then converts every byte into a scalar value string (in fact a percent-encoded byte, a special type of string). So the return type of UTF-8 percent encode is a string, while it takes in a code point.

GPHemsley · 2017-06-12T04:49:05Z

Ah, OK, I follow now. Although codePoint is just the name of the variable; there is nothing self-evident about its type other than human intuition. It should say "code point codePoint".

Also, is there a reason these steps are not explicitly using another variable to keep track of the output? It seems unnecessarily convoluted to implicitly keep track of "the results concatenated".

annevk · 2020-05-07T09:38:26Z

I think it's fair that some of the names are confusing a bit but this stems from the concept being known as percent-encoding. Anyone have suggestions for how to rename these but keep the type signatures if we did something here? (More elaborate suggestions welcome if you're interested in taking into account all callers in URL and HTML.)

percent-encoded byte (a type of string)
percent encode (byte -> string)
percent decode (byte sequence -> byte sequence)
string percent decode (string -> byte sequence)
UTF-8 percent encode ((code point, code point set) -> string)

annevk · 2020-05-07T09:42:58Z

Oh, I guess OP is mostly about UTF-8 percent encode not always doing something, not about the type signature.

Would "conditionally-UTF-8-percent encode" work?

Helps with #296.

annevk · 2020-05-12T07:25:47Z

#503 has the direction we're taking this. We'll keep the existing name, but we'll add a table to clarify the operations and use more overloading to reduce the high-level number of operations.

Also start using a hyphen for percent-encode and percent-decode consistently and clarify the various operations and how they relate. This helps #369 and closes #296.

domenic added the non-normative label Feb 7, 2018

annevk added clarification Standard could be clearer good first issue Ideal for someone new to a WHATWG standard or software project and removed non-normative labels Apr 26, 2020

annevk removed the good first issue Ideal for someone new to a WHATWG standard or software project label May 7, 2020

annevk added a commit that referenced this issue May 7, 2020

Editorial: minor tweaks to percent-encoded bytes

648a3c1

Helps with #296.

annevk mentioned this issue May 7, 2020

Editorial: minor tweaks to percent-encoded bytes #500

Merged

annevk added a commit that referenced this issue May 7, 2020

Editorial: minor tweaks to percent-encoded bytes

3133555

Helps with #296.

annevk added a commit that referenced this issue May 8, 2020

Editorial: minor tweaks to percent-encoded bytes

68aad5d

Helps with #296.

annevk mentioned this issue May 9, 2020

Add string UTF-8 percent encode #503

Merged

annevk closed this as completed in #503 May 12, 2020

annevk added a commit that referenced this issue May 12, 2020

Add UTF-8 percent-encode for strings

8e1c9e3

Also start using a hyphen for percent-encode and percent-decode consistently and clarify the various operations and how they relate. This helps #369 and closes #296.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"UTF-8 percent encode c using the path percent-e..." #296

"UTF-8 percent encode c using the path percent-e..." #296

Hixie commented Apr 21, 2017

TimothyGu commented Apr 21, 2017

zcorpan commented Apr 21, 2017

Hixie commented Apr 22, 2017

GPHemsley commented Jun 11, 2017

TimothyGu commented Jun 12, 2017

GPHemsley commented Jun 12, 2017

annevk commented May 7, 2020

annevk commented May 7, 2020

annevk commented May 12, 2020

"UTF-8 percent encode c using the path percent-e..." #296

"UTF-8 percent encode c using the path percent-e..." #296

Comments

Hixie commented Apr 21, 2017

TimothyGu commented Apr 21, 2017

zcorpan commented Apr 21, 2017

Hixie commented Apr 22, 2017

GPHemsley commented Jun 11, 2017

TimothyGu commented Jun 12, 2017

GPHemsley commented Jun 12, 2017

annevk commented May 7, 2020

annevk commented May 7, 2020

annevk commented May 12, 2020