-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Define JavaScript string and scalar value string #73
Changes from 4 commits
7fec00b
98d91d6
9323429
8fc8fa0
7d155b7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -26,6 +26,7 @@ Boilerplate: omit conformance, omit feedback-header, omit idl-index | |
<pre class="anchors"> | ||
urlPrefix: https://tc39.github.io/ecma262/; spec: ECMA-262; type: dfn | ||
text: List; url: sec-list-and-record-specification-type | ||
text: The String Type; url: sec-ecmascript-language-types-string-type | ||
</pre> | ||
|
||
|
||
|
@@ -252,8 +253,10 @@ in parentheses. [[!UNICODE]] | |
|
||
<p>In certain contexts <a>code points</a> are prefixed with "0x" instead of "U+". | ||
|
||
<p>A <dfn export>scalar value</dfn> is a <a>code point</a> that is not in the range | ||
U+D800 to U+DFFF, inclusive. | ||
<p>A <dfn export>surrogate</dfn> is a <a>code point</a> that is in the range U+D800 to U+DFFF, | ||
inclusive. | ||
|
||
<p>A <dfn export>scalar value</dfn> is a <a>code point</a> that is not a <a>surrogate</a>. | ||
|
||
<p>An <dfn export>ASCII code point</dfn> is a <a>code point</a> in the range U+0000 to U+007F, | ||
inclusive. | ||
|
@@ -294,11 +297,55 @@ inclusive. | |
|
||
<h3 id=strings>Strings</h3> | ||
|
||
<p>A <dfn export>string</dfn> is a sequence of <a>code points</a>. Strings are denoted by double | ||
quotes and monospace font. | ||
<p>A <dfn export>JavaScript string</dfn> is a sequence of unsigned 16-bit integers, also known as | ||
<dfn export lt="code unit">code units</dfn>. | ||
|
||
<p class=note>This is different from how the Unicode Standard defines "code unit". In particular it | ||
refers exclusively to how the Unicode Standard defines it for Unicode 16-bit strings. [[UNICODE]] | ||
|
||
<p>A <a>JavaScript string</a> can also be interpreted as containing <a>code points</a>, per the | ||
conversion defined in <a>The String Type</a> section of the JavaScript specification. [[!ECMA-262]] | ||
|
||
<p class=note>This conversion process converts surrogate pairs into their corresponding | ||
<a>scalar value</a> and maps isolated surrogates to their corresponding <a>code point</a>, leaving | ||
them effectively as-is. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would be very interested in seeing an example of this conversion in action, that illustrates both of these points. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wouldn't ECMAScript be a better place for that? I suppose I can add something here though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It probably would, but I expect people to look to Infra for this sort of thing, and I think the fact that we use slightly more rigorous terminology will help make any example here clearer. |
||
|
||
<p class=example id=example-javascript-string-in-code-points>A <a>JavaScript string</a> consisting | ||
of the <a>code units</a> 0xD83D, 0xDCA9, and 0xD800, when interpreted as containing | ||
<a>code points</a>, would consist of the <a>code points</a> U+1F4A9 and U+D800. | ||
|
||
<p>A <dfn export>scalar value string</dfn> is a sequence of <a>scalar values</a>. | ||
|
||
<p class=note>A <a>scalar value string</a> is useful for any kind of I/O or other kind of operation | ||
where <a>UTF-8 encode</a> comes into play. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we need to expand this reasoning a bit more, given how confused people are about when to use USVString in Web IDL. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What else could we say? I think the confusion is mostly around folks not realizing that UTF-8 encode only handles scalar values. |
||
<!-- It's also useful if you can imagine the subsystem to be implemented in Rust --> | ||
|
||
<p><dfn export lt=string>String</dfn> can be used to refer to either a <a>JavaScript string</a> or | ||
<a>scalar value string</a>, when it is clear from the context which is meant or when the distinction | ||
is immaterial. <a>Strings</a> are denoted by double quotes and monospace font. | ||
|
||
<p class=example id=example-string-notation>"<code>Hello, world!</code>" is a string. | ||
|
||
<p>To <dfn export for="JavaScript string">convert</dfn> a <a>JavaScript string</a> into a | ||
<a>scalar value string</a>, replace any <a>surrogates</a> with U+FFFD. | ||
<!-- Obviates need for https://heycam.github.io/webidl/#dfn-obtain-unicode --> | ||
|
||
<p class=note>The replaced surrogates are always isolated surrogates, since the process of | ||
interpreting the JavaScript string as containing <a>code points</a> will have converted surrogate | ||
pairs into single non-surrogate code points.) | ||
|
||
<p>A <a>scalar value string</a> can always be used as <a>JavaScript string</a> implicitly since it | ||
is a subset. The reverse is only possible if the <a>JavaScript string</a> is known to not contain | ||
<a>surrogates</a>; otherwise a <a for="JavaScript string" lt=convert>conversion</a> must be | ||
performed. | ||
|
||
<p class=note>An implementation likely has to perform explicit conversion, depending on how it | ||
actually ends up representing <a lt="JavaScript string">JavaScript</a> and | ||
<a>scalar value strings</a>. It is even fairly typical for implementations to have multiple | ||
implementations of just <a>JavaScript strings</a> for performance and memory reasons. | ||
|
||
<hr> | ||
|
||
<p>An <dfn export>ASCII string</dfn> is a <a>string</a> whose <a>code points</a> are all | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd add a |
||
<a>ASCII code points</a>. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention that these definitions come from Unicode? It seems like we should give them credit for this edifice in some way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We acknowledge them for code point explicitly and everything else builds on top of that. I'll reference them again for code unit though as you suggested.