Rethink strings #1

annevk · 2016-11-02T14:06:39Z

I need to study the various dependencies of strings and figure out what we want to do. It seems there's a couple kind of strings that probably need to be distinguished and named somehow:

JavaScript strings - each code point is in the range U+0000 to U+FFFF
scalar value strings - each code point is a scalar value
byte strings - each code point is in the range U+0000 to U+00FF
ASCII strings - each code point is an ASCII code point
strings - each code point is a code point (I don't think we really have this in the platform even though Encoding defines this kind of string; we have a variant of this where valid surrogate pairs are treated as their own code point)

annevk · 2016-11-08T10:24:25Z

I need some advice here on how we wish to proceed.

HTML says "code unit" is defined by IDL as restricted to being a 16-bit integer.

IDL uses "code unit" for DOMString (16-bit integer), USVString (21-bit integer), and ByteString (8-bit integer).

We tend to use "code point", "code unit", and "character" in roughly the same way, even though that is not correct per Unicode.

We have a special kind of string where we take a JavaScript string and combine surrogate pairs, but leave lone surrogates alone. This is what the platform displays on screen and various JavaScript operations use.

My thinking is that we should have "JavaScript string" (DOMString) and indexing upon that goes through "code unit". A "JavaScript string" can be addressed as "string" as well (some magic casting underneath) at which point you address code points and (valid) surrogate pairs represent a single code point.

We also have a "scalar value string" and indexing upon that can go through "code unit", but that ends up meaning the same thing as "code point" (though lone surrogates cannot be found or added). We leave enforcing validity of a "scalar value string" to the users. It's mostly for implementers and clarity. A "scalar value string" can also be addressed as "string" since it's compatible.

We keep the designation "ASCII string" which specification can use as an optimization hint. (That's why URL uses it for instance.) Again, addressable as "string".

We don't need "byte string" I think. The idea with ByteString is that it's an input and return value. The IDL algorithm actually operates a byte sequence.

domenic · 2016-11-08T22:56:36Z

I don't feel terribly qualfied in this area.

I think it would be good if we matched Unicode as much as possible. Maybe we can continue abusing a generic term like "character", but we should use "code point" and "code unit" correctly.

Your plan sounds pretty good for the different types of strings. I think we'll want to carefully spell things out in such a way that most of the time people can avoid knowing or talking about the difference. I guess I haven't seen people index into/iterate over units of a string in specs, most of the time, but maybe some of the parsing algorithms that operate on post-decoding strings do.

xfq · 2017-01-11T01:27:18Z

We tend to use "code point", "code unit", and "character" in roughly the same way, even though that is not correct per Unicode.

I agree. The current definition of string is "a sequence of code points", but a code point is a non-negative integer. It can represent a character, but it is not a character.

Quoting from the Character Model spec:

The 'character string' definition SHOULD be used by most specifications.

However, the definition of string in TUS 9.0 Section 3.9 is:

Unicode string: A code unit sequence containing code units of a particular Unicode encoding form.

That is, strings are defined as code unit strings, instead of character strings.

Personally, I like the 'character string' definition, that is, a string is viewed as a sequence of characters, each represented by a code point. Although matching Unicode as much as possible SHOULD be a goal, in this case, I think the 'character string' definition is more suitable than the Unicode definition, since it has the highest layer of abstraction (which ensures interoperability). Of course, we can define both 'code unit string' and 'character string', plus other kinds of strings @annevk mentioned.

We don't need "byte string" I think.

Agreed. The Character Model spec says "Specifications SHOULD NOT define a string as a 'byte string'", and their rationale seems fair enough.

See also: https://w3c.github.io/bp-i18n-specdev/#characters

aphillips · 2017-01-17T18:17:33Z

@xfq That's not a correct reading of TUS or Charmod.

The TUS definition you cite includes the phrase "... of a particular Unicode encoding form", which means that an encoding (UTF-8, UTF-16, UTF-32) must be defined. However...

Charmod actually defines the term 'character string' itself and that's the definition that should be inferred from the quoted requirement:

Character string: A string viewed as a sequence of characters, each represented by a code point in Unicode [Unicode].

What's missing from the current definition is that 'code point' means 'Unicode Scalar Value'.

@annevk The terms 'character', 'code point', and 'code unit', in my opinion, should follow Unicode and/or Charmod. In your original statement at the top of this issue, for 'JavaScript', 'byte', and 'ASCII' strings, I would use the term 'code unit' where you said 'code point', since 'code point' is essentially a synonym for 'Unicode Scalar Value'. In the case of JS, the encoding is UTF-16. In the case of 'byte string', it's usually UTF-8 (pace Encoding). ASCII string's encoding is pretty clear :-).

Generally speaking, it's usually best to refer to strings as 'character strings', as Charmod recommends, although a lot of the Web platform relies on DOM and that necessarily involves UTF-16. With the rise of emoji, there are lots of supplementary (surrogate pair in UTF-16) characters in the world, so a lot of care needs to be used in ensuring that code point and code unit in a specific encoding are used specifically.

domenic · 2017-01-17T18:20:53Z

Thanks so much for weighin in @aphillips. I hope we can end up with something you're happy with and spread it across as many web specs as possible :).

One thing to note is that in the web specs space I've always seen "code unit", unqualified, as meaning UTF-16 code unit. I.e., the thing that JavaScript deals with. It sure is nice to be able to say that for brevity, and I'd kind of like to be able to keep that, but maybe it is too confusing and we should say "UTF-16 code unit" everywhere?

aphillips · 2017-01-17T18:30:49Z

Happy to help. I think this thread (and others like it) help illustrate the need for some precision. Like you, I generally read 'code unit' in a Web spec to mean "UTF-16 code unit" and the problem is deciding whether importing UTF-16 was intentional vs. using 'code point' (USV). As such, 'code unit' has to remain distinct from 'character'.

I don't have a problem making a definition that allows us all to infer the UTF-16 part. Just need to make it a referenced definition. Otherwise folks have a way of getting sloppy and treating 'character', 'code point', and 'code unit' as the same---and getting into trouble when there's an emoji (or such) in the data.

annevk · 2017-01-18T12:29:36Z

I think using UTF-16 is a bit of a distraction since the sequences we are dealing with do not have to match UTF-16. They are simply sequences of 16-bit integers.

That is also why I would not want to interchange code point and scalar value, since we can and do have code points that are not scalar values (yay lone surrogates) within the web platform.

I still think we want JavaScript string and scalar value string.

A JavaScript string is as defined in ECMAScript, including how ECMAScript defines to extract code points from it. You cannot use the term scalar value to identify items in JavaScript strings.

A scalar value string is IDL's USVString or @aphillips's "character string". It can be backed by any Unicode encoding, cannot be indexed by code unit (since we'll use that exclusively for JavaScript strings as we have already been doing, rather than what I suggested earlier), and here code point and scalar value can be used interchangeably.

Casting a JavaScript string into a scalar value string should be easy and will cause lone surrogates to turn into U+FFFD. Basically what IDL already defines for USVString (but we'd define that in terms of this instead going forward).

The term string can be used to reference to either when it's already established what the type is.

xfq · 2017-01-18T14:15:14Z

What's missing from the current definition is that 'code point' means 'Unicode Scalar Value'.

[...]

since 'code point' is essentially a synonym for 'Unicode Scalar Value'

@aphillips Would you please elaborate why?

What about high- and low- surrogate code points?

aphillips · 2017-01-18T16:25:40Z

@xfq You're correct: I glossed over the difference and should not have.

@annevk I almost but don't quite agree. While in some ways "UTF-16" doesn't matter, the problem I have is that the term "code points" is confusing compared to the term "code units" when talking about JavaScript strings.

Consider two strings:

X: D800 DBFF D800 DBFF
Y: D800 DC00 D800 DC00

String X encodes 4 code points using 4 code units
String Y encodes 2 code points using 4 code units.

Technically, String X encodes no (Unicode) characters, while String Y encodes 2. In certain contexts, one can say that String X encodes 4 isolated surrogates or 4 of U+FFFD. I just think that using the term code point here produces surprise where using the term code unit does not.

I'm good with the idea of JavaScript vs. scalar string. JavaScript strings (and their friends in other programming languages, such as Java) are just 16-bit integer arrays. Invalid values such as FFFF and FFFE can be encoded as well as our friend String X. But it's generally helpful to note the relationship to UTF-16 because of the need for surrogate processing.

annevk · 2017-01-18T18:36:32Z

@aphillips the term code points for JavaScript is relevant when discussing a string like DC00 D800 DC00, which would have two code points and three code units. And as I said, the JavaScript standard already makes that distinction.

And also surrogate code point, code unit, and cast (for strings). Fixes #1.

annevk mentioned this issue Nov 8, 2016

Using generics for bytes / code units / code points #17

Closed

domenic mentioned this issue Nov 17, 2016

Should we x-link base concepts to the Infra Standard? whatwg/webidl#242

Closed

annevk mentioned this issue Jan 10, 2017

Link "string"s to their definitions #51

Merged

annevk mentioned this issue Jan 17, 2017

Add URLSearchParams.prototype.sort() whatwg/url#199

Merged

annevk mentioned this issue Feb 20, 2017

Code-points, code-units, characters, oh my! whatwg/webidl#312

Closed

annevk added a commit that referenced this issue Mar 17, 2017

Define JavaScript string and scalar value string

9ad07c9

And also surrogate code point, code unit, and cast (for strings). Fixes #1.

annevk mentioned this issue Mar 17, 2017

Define JavaScript string and scalar value string #73

Merged

annevk added a commit that referenced this issue Mar 23, 2017

Define JavaScript string and scalar value string

7fec00b

And also surrogate code point, code unit, and cast (for strings). Fixes #1.

annevk closed this as completed in f1be763 Mar 27, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethink strings #1

Rethink strings #1

annevk commented Nov 2, 2016

annevk commented Nov 8, 2016

domenic commented Nov 8, 2016

xfq commented Jan 11, 2017 •

edited

Loading

aphillips commented Jan 17, 2017

domenic commented Jan 17, 2017

aphillips commented Jan 17, 2017

annevk commented Jan 18, 2017

xfq commented Jan 18, 2017 •

edited

Loading

aphillips commented Jan 18, 2017

annevk commented Jan 18, 2017

Rethink strings #1

Rethink strings #1

Comments

annevk commented Nov 2, 2016

annevk commented Nov 8, 2016

domenic commented Nov 8, 2016

xfq commented Jan 11, 2017 • edited Loading

aphillips commented Jan 17, 2017

domenic commented Jan 17, 2017

aphillips commented Jan 17, 2017

annevk commented Jan 18, 2017

xfq commented Jan 18, 2017 • edited Loading

aphillips commented Jan 18, 2017

annevk commented Jan 18, 2017

xfq commented Jan 11, 2017 •

edited

Loading

xfq commented Jan 18, 2017 •

edited

Loading