Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define JavaScript string and scalar value string #73

Merged
merged 5 commits into from
Mar 27, 2017
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 51 additions & 4 deletions infra.bs
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ Boilerplate: omit conformance, omit feedback-header, omit idl-index
<pre class="anchors">
urlPrefix: https://tc39.github.io/ecma262/; spec: ECMA-262; type: dfn
text: List; url: sec-list-and-record-specification-type
text: The String Type; url: sec-ecmascript-language-types-string-type
</pre>


Expand Down Expand Up @@ -252,8 +253,10 @@ in parentheses. [[!UNICODE]]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention that these definitions come from Unicode? It seems like we should give them credit for this edifice in some way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We acknowledge them for code point explicitly and everything else builds on top of that. I'll reference them again for code unit though as you suggested.

<p>In certain contexts <a>code points</a> are prefixed with "0x" instead of "U+".

<p>A <dfn export>scalar value</dfn> is a <a>code point</a> that is not in the range
U+D800 to U+DFFF, inclusive.
<p>A <dfn export>surrogate</dfn> is a <a>code point</a> that is in the range U+D800 to U+DFFF,
inclusive.

<p>A <dfn export>scalar value</dfn> is a <a>code point</a> that is not a <a>surrogate</a>.

<p>An <dfn export>ASCII code point</dfn> is a <a>code point</a> in the range U+0000 to U+007F,
inclusive.
Expand Down Expand Up @@ -294,11 +297,55 @@ inclusive.

<h3 id=strings>Strings</h3>

<p>A <dfn export>string</dfn> is a sequence of <a>code points</a>. Strings are denoted by double
quotes and monospace font.
<p>A <dfn export>JavaScript string</dfn> is a sequence of unsigned 16-bit integers, also known as
<dfn export lt="code unit">code units</dfn>.

<p class=note>This is different from how the Unicode Standard defines "code unit". In particular it
refers exclusively to how the Unicode Standard defines it for Unicode 16-bit strings. [[UNICODE]]

<p>A <a>JavaScript string</a> can also be interpreted as containing <a>code points</a>, per the
conversion defined in <a>The String Type</a> section of the JavaScript specification. [[!ECMA-262]]

<p class=note>This conversion process converts surrogate pairs into their corresponding
<a>scalar value</a> and maps isolated surrogates to their corresponding <a>code point</a>, leaving
them effectively as-is.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be very interested in seeing an example of this conversion in action, that illustrates both of these points.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't ECMAScript be a better place for that? I suppose I can add something here though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably would, but I expect people to look to Infra for this sort of thing, and I think the fact that we use slightly more rigorous terminology will help make any example here clearer.


<p class=example id=example-javascript-string-in-code-points>A <a>JavaScript string</a> consisting
of the <a>code units</a> 0xD83D, 0xDCA9, and 0xD800, when interpreted as containing
<a>code points</a>, would consist of the <a>code points</a> U+1F4A9 and U+D800.

<p>A <dfn export>scalar value string</dfn> is a sequence of <a>scalar values</a>.

<p class=note>A <a>scalar value string</a> is useful for any kind of I/O or other kind of operation
where <a>UTF-8 encode</a> comes into play.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to expand this reasoning a bit more, given how confused people are about when to use USVString in Web IDL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What else could we say? I think the confusion is mostly around folks not realizing that UTF-8 encode only handles scalar values.

<!-- It's also useful if you can imagine the subsystem to be implemented in Rust -->

<p><dfn export lt=string>String</dfn> can be used to refer to either a <a>JavaScript string</a> or
<a>scalar value string</a>, when it is clear from the context which is meant or when the distinction
is immaterial. <a>Strings</a> are denoted by double quotes and monospace font.

<p class=example id=example-string-notation>"<code>Hello, world!</code>" is a string.

<p>To <dfn export for="JavaScript string">convert</dfn> a <a>JavaScript string</a> into a
<a>scalar value string</a>, replace any <a>surrogates</a> with U+FFFD.
<!-- Obviates need for https://heycam.github.io/webidl/#dfn-obtain-unicode -->

<p class=note>The replaced surrogates are always isolated surrogates, since the process of
interpreting the JavaScript string as containing <a>code points</a> will have converted surrogate
pairs into single non-surrogate code points.)

<p>A <a>scalar value string</a> can always be used as <a>JavaScript string</a> implicitly since it
is a subset. The reverse is only possible if the <a>JavaScript string</a> is known to not contain
<a>surrogates</a>; otherwise a <a for="JavaScript string" lt=convert>conversion</a> must be
performed.

<p class=note>An implementation likely has to perform explicit conversion, depending on how it
actually ends up representing <a lt="JavaScript string">JavaScript</a> and
<a>scalar value strings</a>. It is even fairly typical for implementations to have multiple
implementations of just <a>JavaScript strings</a> for performance and memory reasons.

<hr>

<p>An <dfn export>ASCII string</dfn> is a <a>string</a> whose <a>code points</a> are all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a <hr> here

<a>ASCII code points</a>.

Expand Down