Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define JavaScript string and scalar value string #73

Merged
merged 5 commits into from
Mar 27, 2017
Merged

Conversation

annevk
Copy link
Member

@annevk annevk commented Mar 17, 2017

And also surrogate code point, code unit, and cast (for strings). Fixes
#1.


Preview | Diff

Copy link
Member

@domenic domenic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woohoo, this is looking great. So glad you were able to tackle this; I think it will make spec authors everywhere happy.

After addressing my comments let's also tag in aphillips.

infra.bs Outdated
<p>A <dfn export>string</dfn> is a sequence of <a>code points</a>. Strings are denoted by double
quotes and monospace font.
<p>A <dfn export>JavaScript string</dfn> is a sequence of unsigned 16-bit integers, also known as
<dfn export lt="code unit">code units</dfn>. A <a>JavaScript string</a> can also be interpreted as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should maybe mention in a note that in Unicode code unit is an encoding-dependent concept, but for our purposes we're specializing it to mean UTF-16 since that's what's most useful in this ecosystem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, note that it isn't even like UTF-16 due to lone surrogates.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It isn't UTF-16. It's what Unicode (TUS Section 2.7) calls a Unicode 16-bit string. It's worth glancing at that section of Unicode

@@ -216,8 +217,11 @@ in parentheses. [[!UNICODE]]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention that these definitions come from Unicode? It seems like we should give them credit for this edifice in some way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We acknowledge them for code point explicitly and everything else builds on top of that. I'll reference them again for code unit though as you suggested.

infra.bs Outdated
JavaScript specification. [[!ECMA-262]]

<p>A <dfn export>scalar value string</dfn> is a sequence of <a>scalar values</a>. These
<a>scalar values</a> can also be addressed as <a>code points</a>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this last sentence is useful. Since every scalar value is a code point it seems kind of redundant.

infra.bs Outdated
<a>scalar values</a> can also be addressed as <a>code points</a>.

<p class=note>A <a>scalar value string</a> is useful for any kind of I/O or other kind of operation
where <a>UTF-8 encode</a> comes into play.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to expand this reasoning a bit more, given how confused people are about when to use USVString in Web IDL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What else could we say? I think the confusion is mostly around folks not realizing that UTF-8 encode only handles scalar values.

infra.bs Outdated

<p class=example id=example-string-notation>"<code>Hello, world!</code>" is a string.

<p>To <dfn export for=string>cast</dfn> a <a>JavaScript string</a> into a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a less generic word for this procedure? Maybe one people are already using? I guess not since Web IDL has a very long phrase for it.

Maybe "scalarize"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is wrong with a generic term when coupled with for=? We could also use "convert" if you don't like "cast". The for should probably be "JavaScript string" though I suppose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess as long as the call sites always say "to a scalar value string" it works. Whereas if it was "scalarize" the call sites could do "scalarize s". Both are fine.

infra.bs Outdated
<!-- Obviates need for https://heycam.github.io/webidl/#dfn-obtain-unicode -->

<p class=note><a for=string>Casting</a> a <a>scalar value string</a> into a <a>JavaScript string</a>
happens implicitly as desired as it is a lossless operation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't use "casting", at least not with an <a>, for this direction. You defined it to only go in the other direction.

I'd phrase this as "can be done implicitly, since it's a lossless operation". The double-as is a bit awkward.

@annevk
Copy link
Member Author

annevk commented Mar 18, 2017

@aphillips could you please take a look as well? Thanks!

infra.bs Outdated
<p class=note>This is different from how the Unicode specification defines "code unit". In
particular it refers exclusively to an unsigned 16-bit integer as is similar to how "code unit" is
defined for UTF-16. However, therefore it can also refer to a surrogate that is not part of a
surrogate pair, which is not the case for "code unit" as defined for UTF-16. [[UNICODE]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this note may be factually incorrect. It is the case that UTF-16 does not allow isolated surrogates. But the definition of code unit allows values in the surrogate range (how could it not?). The UTF-16 encoding form doesn't permit them to be isolated if the text is to be well-formed. But Unicode 16-bit strings are not required to be well-formed (see my comment above)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll study that section.

infra.bs Outdated

<p class=example id=example-string-notation>"<code>Hello, world!</code>" is a string.

<p>To <dfn export for="JavaScript string">convert</dfn> a <a>JavaScript string</a> into a
<a>scalar value string</a>, replace any <a>surrogate code points</a> with U+FFFD.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably add the word "unpaired" outside the anchor tag for clarity. I know that this phrasing is correct, but you never quite say that properly paired surrogate code units make a supplementary code point. I think it might be useful to say that this conversion results in a (well-formed) UTF-16 string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's defined by JavaScript. I guess I could add a note saying what JavaScript does there though.

infra.bs Outdated

<p class=note>A <a>scalar value string</a> can always be used as <a>JavaScript string</a> implicitly
since it is a subset. The reverse is only possible if the <a>JavaScript string</a> is known to not
contain <a>surrogate code points</a>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I think surrogate code points is misleading (but not factually incorrect) here.

I guess the name "scalar value string" is still bothering me. I tend to expect a scalar value string to be a sequence of code points--Unicode scalar values--but this is a UTF-16 string. I'm concerned that the implicit conversion here is just a semantic nicety, not something someone would actually implement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, once you interpret a JavaScript string as containing code points, it's not at all related to UTF-16 anymore. It's just a sequence of code points with paired surrogates replaced.

@annevk
Copy link
Member Author

annevk commented Mar 19, 2017

@aphillips I guess maybe you're asking that we explicitly point out that a scalar value string can be represented in UTF-8 (and JavaScript string cannot always) and that any implicit conversion in standards might actually have to be an explicit conversion in implementations? I guess that's probably reasonable to do, since it's a rather complicated topic. I'll see about adding some more notes.

I already fixed the incorrect bits around code units. I didn't realize / forgot Unicode defined Unicode 16-bit strings. Good to know!

@annevk
Copy link
Member Author

annevk commented Mar 21, 2017

@aphillips could you please give this another pass? Would be much appreciated. I'd really like to get this right.


<p class=note>This conversion process converts surrogate pairs into their corresponding
<a>scalar value</a> and maps isolated surrogates to their corresponding <a>code point</a>, leaving
them effectively as-is.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be very interested in seeing an example of this conversion in action, that illustrates both of these points.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't ECMAScript be a better place for that? I suppose I can add something here though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably would, but I expect people to look to Infra for this sort of thing, and I think the fact that we use slightly more rigorous terminology will help make any example here clearer.

infra.bs Outdated
@@ -216,8 +217,11 @@ in parentheses. [[!UNICODE]]

<p>In certain contexts <a>code points</a> are prefixed with "0x" instead of "U+".

<p>A <dfn export>scalar value</dfn> is a <a>code point</a> that is not in the range
U+D800 to U+DFFF, inclusive.
<p>A <dfn export>surrogate code point</dfn> is a <a>code point</a> that is in the range U+D800 to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the term should be "unpaired surrogate" (or "isolated surrogate"?) instead of "surrogate code point"? Not sure I really understand this part though.

It's a bit strange that the word "code point" is included in this term but not in "scalar value".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it should just be surrogate. You need more context to know whether it's isolated or paired. I guess I could drop "code point". I added it mainly to make it a bit more unique.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so then I don't quite understand why "Per definition these are isolated surrogates." later is true. What is special about the surrogates inside a JavaScript string that makes them always isolated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you talk about code points in a JavaScript string, the code units that represent paired surrogates will have been replaced by a single scalar value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that argues for adding that example you asked for. It's not hard and I'll do it later today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! So there is an implicit "interpret as containing code points" step. Yeah, I totally missed that.

infra.bs Outdated
<span class=note>Per definition these are isolated surrogates.</span>
<!-- Obviates need for https://heycam.github.io/webidl/#dfn-obtain-unicode -->

<p class=note>A <a>scalar value string</a> can always be used as <a>JavaScript string</a> implicitly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe lead with "In specifications, " since a large part of this note is contrasting specifications and implementations.

infra.bs Outdated
<a>scalar value strings</a>. It is even fairly typical for implementations to have multiple
implementations of just <a>JavaScript strings</a> for performance reasons and reducing memory
usage.)

<p>An <dfn export>ASCII string</dfn> is a <a>string</a> whose <a>code points</a> are all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a <hr> here

And also surrogate code point, code unit, and cast (for strings). Fixes
#1.
infra.bs Outdated
interpreting the JavaScript string as containing <a>code points</a> will have converted surrogate
pairs into single non-surrogate code points.)

A <a>scalar value string</a> can always be used as <a>JavaScript string</a> implicitly since it is a
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing <p>. I'm surprised Bikeshed didn't complain to me about "surrogate code point" above.

Copy link
Member

@domenic domenic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With your and my latest tweaks this is crystal-clear to my eyes.

@annevk
Copy link
Member Author

annevk commented Mar 23, 2017

I emailed @aphillips for a final review. I can probably wait until Monday with landing to give him some more time and then start working on the other bits that need doing around strings.

@annevk
Copy link
Member Author

annevk commented Mar 23, 2017

We should also add him to the acknowledgments section before landing.

@annevk annevk merged commit f1be763 into master Mar 27, 2017
@annevk annevk deleted the annevk/string branch March 27, 2017 07:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants