Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethink strings #1

Closed
annevk opened this issue Nov 2, 2016 · 10 comments
Closed

Rethink strings #1

annevk opened this issue Nov 2, 2016 · 10 comments

Comments

@annevk
Copy link
Member

annevk commented Nov 2, 2016

I need to study the various dependencies of strings and figure out what we want to do. It seems there's a couple kind of strings that probably need to be distinguished and named somehow:

  • JavaScript strings - each code point is in the range U+0000 to U+FFFF
  • scalar value strings - each code point is a scalar value
  • byte strings - each code point is in the range U+0000 to U+00FF
  • ASCII strings - each code point is an ASCII code point
  • strings - each code point is a code point (I don't think we really have this in the platform even though Encoding defines this kind of string; we have a variant of this where valid surrogate pairs are treated as their own code point)
@annevk
Copy link
Member Author

annevk commented Nov 8, 2016

I need some advice here on how we wish to proceed.

HTML says "code unit" is defined by IDL as restricted to being a 16-bit integer.

IDL uses "code unit" for DOMString (16-bit integer), USVString (21-bit integer), and ByteString (8-bit integer).

We tend to use "code point", "code unit", and "character" in roughly the same way, even though that is not correct per Unicode.

We have a special kind of string where we take a JavaScript string and combine surrogate pairs, but leave lone surrogates alone. This is what the platform displays on screen and various JavaScript operations use.


My thinking is that we should have "JavaScript string" (DOMString) and indexing upon that goes through "code unit". A "JavaScript string" can be addressed as "string" as well (some magic casting underneath) at which point you address code points and (valid) surrogate pairs represent a single code point.

We also have a "scalar value string" and indexing upon that can go through "code unit", but that ends up meaning the same thing as "code point" (though lone surrogates cannot be found or added). We leave enforcing validity of a "scalar value string" to the users. It's mostly for implementers and clarity. A "scalar value string" can also be addressed as "string" since it's compatible.

We keep the designation "ASCII string" which specification can use as an optimization hint. (That's why URL uses it for instance.) Again, addressable as "string".

We don't need "byte string" I think. The idea with ByteString is that it's an input and return value. The IDL algorithm actually operates a byte sequence.

@domenic
Copy link
Member

domenic commented Nov 8, 2016

I don't feel terribly qualfied in this area.

I think it would be good if we matched Unicode as much as possible. Maybe we can continue abusing a generic term like "character", but we should use "code point" and "code unit" correctly.

Your plan sounds pretty good for the different types of strings. I think we'll want to carefully spell things out in such a way that most of the time people can avoid knowing or talking about the difference. I guess I haven't seen people index into/iterate over units of a string in specs, most of the time, but maybe some of the parsing algorithms that operate on post-decoding strings do.

@xfq
Copy link
Contributor

xfq commented Jan 11, 2017

We tend to use "code point", "code unit", and "character" in roughly the same way, even though that is not correct per Unicode.

I agree. The current definition of string is "a sequence of code points", but a code point is a non-negative integer. It can represent a character, but it is not a character.


Quoting from the Character Model spec:

The 'character string' definition SHOULD be used by most specifications.

However, the definition of string in TUS 9.0 Section 3.9 is:

Unicode string: A code unit sequence containing code units of a particular Unicode encoding form.

That is, strings are defined as code unit strings, instead of character strings.

Personally, I like the 'character string' definition, that is, a string is viewed as a sequence of characters, each represented by a code point. Although matching Unicode as much as possible SHOULD be a goal, in this case, I think the 'character string' definition is more suitable than the Unicode definition, since it has the highest layer of abstraction (which ensures interoperability). Of course, we can define both 'code unit string' and 'character string', plus other kinds of strings @annevk mentioned.


We don't need "byte string" I think.

Agreed. The Character Model spec says "Specifications SHOULD NOT define a string as a 'byte string'", and their rationale seems fair enough.


See also: https://w3c.github.io/bp-i18n-specdev/#characters

@aphillips
Copy link
Contributor

@xfq That's not a correct reading of TUS or Charmod.

The TUS definition you cite includes the phrase "... of a particular Unicode encoding form", which means that an encoding (UTF-8, UTF-16, UTF-32) must be defined. However...

Charmod actually defines the term 'character string' itself and that's the definition that should be inferred from the quoted requirement:

Character string: A string viewed as a sequence of characters, each represented by a code point in Unicode [Unicode].

What's missing from the current definition is that 'code point' means 'Unicode Scalar Value'.

@annevk The terms 'character', 'code point', and 'code unit', in my opinion, should follow Unicode and/or Charmod. In your original statement at the top of this issue, for 'JavaScript', 'byte', and 'ASCII' strings, I would use the term 'code unit' where you said 'code point', since 'code point' is essentially a synonym for 'Unicode Scalar Value'. In the case of JS, the encoding is UTF-16. In the case of 'byte string', it's usually UTF-8 (pace Encoding). ASCII string's encoding is pretty clear :-).

Generally speaking, it's usually best to refer to strings as 'character strings', as Charmod recommends, although a lot of the Web platform relies on DOM and that necessarily involves UTF-16. With the rise of emoji, there are lots of supplementary (surrogate pair in UTF-16) characters in the world, so a lot of care needs to be used in ensuring that code point and code unit in a specific encoding are used specifically.

@domenic
Copy link
Member

domenic commented Jan 17, 2017

Thanks so much for weighin in @aphillips. I hope we can end up with something you're happy with and spread it across as many web specs as possible :).

One thing to note is that in the web specs space I've always seen "code unit", unqualified, as meaning UTF-16 code unit. I.e., the thing that JavaScript deals with. It sure is nice to be able to say that for brevity, and I'd kind of like to be able to keep that, but maybe it is too confusing and we should say "UTF-16 code unit" everywhere?

@aphillips
Copy link
Contributor

Happy to help. I think this thread (and others like it) help illustrate the need for some precision. Like you, I generally read 'code unit' in a Web spec to mean "UTF-16 code unit" and the problem is deciding whether importing UTF-16 was intentional vs. using 'code point' (USV). As such, 'code unit' has to remain distinct from 'character'.

I don't have a problem making a definition that allows us all to infer the UTF-16 part. Just need to make it a referenced definition. Otherwise folks have a way of getting sloppy and treating 'character', 'code point', and 'code unit' as the same---and getting into trouble when there's an emoji (or such) in the data.

@annevk
Copy link
Member Author

annevk commented Jan 18, 2017

I think using UTF-16 is a bit of a distraction since the sequences we are dealing with do not have to match UTF-16. They are simply sequences of 16-bit integers.

That is also why I would not want to interchange code point and scalar value, since we can and do have code points that are not scalar values (yay lone surrogates) within the web platform.

I still think we want JavaScript string and scalar value string.

A JavaScript string is as defined in ECMAScript, including how ECMAScript defines to extract code points from it. You cannot use the term scalar value to identify items in JavaScript strings.

A scalar value string is IDL's USVString or @aphillips's "character string". It can be backed by any Unicode encoding, cannot be indexed by code unit (since we'll use that exclusively for JavaScript strings as we have already been doing, rather than what I suggested earlier), and here code point and scalar value can be used interchangeably.

Casting a JavaScript string into a scalar value string should be easy and will cause lone surrogates to turn into U+FFFD. Basically what IDL already defines for USVString (but we'd define that in terms of this instead going forward).

The term string can be used to reference to either when it's already established what the type is.

@xfq
Copy link
Contributor

xfq commented Jan 18, 2017

What's missing from the current definition is that 'code point' means 'Unicode Scalar Value'.

[...]

since 'code point' is essentially a synonym for 'Unicode Scalar Value'

@aphillips Would you please elaborate why?

What about high- and low- surrogate code points?

@aphillips
Copy link
Contributor

@xfq You're correct: I glossed over the difference and should not have.

@annevk I almost but don't quite agree. While in some ways "UTF-16" doesn't matter, the problem I have is that the term "code points" is confusing compared to the term "code units" when talking about JavaScript strings.

Consider two strings:

X: D800 DBFF D800 DBFF
Y: D800 DC00 D800 DC00

String X encodes 4 code points using 4 code units
String Y encodes 2 code points using 4 code units.

Technically, String X encodes no (Unicode) characters, while String Y encodes 2. In certain contexts, one can say that String X encodes 4 isolated surrogates or 4 of U+FFFD. I just think that using the term code point here produces surprise where using the term code unit does not.

I'm good with the idea of JavaScript vs. scalar string. JavaScript strings (and their friends in other programming languages, such as Java) are just 16-bit integer arrays. Invalid values such as FFFF and FFFE can be encoded as well as our friend String X. But it's generally helpful to note the relationship to UTF-16 because of the need for surrogate processing.

@annevk
Copy link
Member Author

annevk commented Jan 18, 2017

@aphillips the term code points for JavaScript is relevant when discussing a string like DC00 D800 DC00, which would have two code points and three code units. And as I said, the JavaScript standard already makes that distinction.

annevk added a commit that referenced this issue Mar 17, 2017
And also surrogate code point, code unit, and cast (for strings). Fixes
#1.
annevk added a commit that referenced this issue Mar 23, 2017
And also surrogate code point, code unit, and cast (for strings). Fixes
#1.
@annevk annevk closed this as completed in f1be763 Mar 27, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants