Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"UTF-8 percent encode c using the path percent-e..." #296

Closed
Hixie opened this issue Apr 21, 2017 · 9 comments · Fixed by #503
Closed

"UTF-8 percent encode c using the path percent-e..." #296

Hixie opened this issue Apr 21, 2017 · 9 comments · Fixed by #503
Labels
clarification Standard could be clearer

Comments

@Hixie
Copy link
Member

Hixie commented Apr 21, 2017

https://url.spec.whatwg.org/commit-snapshots/488c459d9e4245a3f6bf087e7dcd2c7e91487ac5/#url-parsing

UTF-8 percent encode c using the path percent-encode set, and append the result to buffer.

It's not at all clear when reading the parser how a path segment consisting of "%62[" should end up. If I'm reading it right, it should end up as "%25%36%32%5B", which doesn't seem right.

@TimothyGu
Copy link
Member

None of %, 6, 2, [ are part of the path percent-encode set, so they will get appended to buffer verbatim.

@zcorpan
Copy link
Member

zcorpan commented Apr 21, 2017

Would it be clearer if the step

If codePoint is not in percentEncodeSet, then return codePoint.

was taken out of "UTF-8 percent encode" algorithm? And the places that call UTF-8 percent encode first check if c is in the relevant encode set?

@Hixie
Copy link
Member Author

Hixie commented Apr 22, 2017

Oh wow yeah that's really unclear. The "UTF-8 percent encode" algorithm, in normal operation, generally does not encode? Very confusing. :-)

@GPHemsley
Copy link
Member

The "UTF-8 percent encode" algorithm also appears to be unclear as to whether it's operating on code points or bytes. What's its return type?

To UTF-8 percent encode a codePoint, using a percentEncodeSet, run these steps:

  1. If codePoint is not in percentEncodeSet, then return codePoint.
  2. Let bytes be the result of running UTF-8 encode on codePoint.
  3. Percent encode each byte in bytes, and then return the results concatenated, in the same order.

@TimothyGu
Copy link
Member

Well codePoint should be self-evidently a code point. UTF-8 encode converts the code point into a byte sequence bytes. Percent encode then converts every byte into a scalar value string (in fact a percent-encoded byte, a special type of string). So the return type of UTF-8 percent encode is a string, while it takes in a code point.

@GPHemsley
Copy link
Member

Ah, OK, I follow now. Although codePoint is just the name of the variable; there is nothing self-evident about its type other than human intuition. It should say "code point codePoint".

Also, is there a reason these steps are not explicitly using another variable to keep track of the output? It seems unnecessarily convoluted to implicitly keep track of "the results concatenated".

@annevk annevk added clarification Standard could be clearer good first issue Ideal for someone new to a WHATWG standard or software project and removed non-normative labels Apr 26, 2020
@annevk annevk removed the good first issue Ideal for someone new to a WHATWG standard or software project label May 7, 2020
@annevk
Copy link
Member

annevk commented May 7, 2020

I think it's fair that some of the names are confusing a bit but this stems from the concept being known as percent-encoding. Anyone have suggestions for how to rename these but keep the type signatures if we did something here? (More elaborate suggestions welcome if you're interested in taking into account all callers in URL and HTML.)

  • percent-encoded byte (a type of string)
  • percent encode (byte -> string)
  • percent decode (byte sequence -> byte sequence)
  • string percent decode (string -> byte sequence)
  • UTF-8 percent encode ((code point, code point set) -> string)

@annevk
Copy link
Member

annevk commented May 7, 2020

Oh, I guess OP is mostly about UTF-8 percent encode not always doing something, not about the type signature.

Would "conditionally-UTF-8-percent encode" work?

@annevk
Copy link
Member

annevk commented May 12, 2020

#503 has the direction we're taking this. We'll keep the existing name, but we'll add a table to clarify the operations and use more overloading to reduce the high-level number of operations.

annevk added a commit that referenced this issue May 12, 2020
Also start using a hyphen for percent-encode and percent-decode consistently and clarify the various operations and how they relate.

This helps #369 and closes #296.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clarification Standard could be clearer
Development

Successfully merging a pull request may close this issue.

6 participants