`the SV of RegExpUnicodeEscapeSequence` is used but not defined #1861

bakkot · 2020-02-01T06:27:25Z

In the Early Errors for RegExp literals there are two references to the SV of RegExpUnicodeEscapeSequence. Presumably this is intended to be a reference to 11.8.4.2 Static Semantics: SV, but that section only gives SVs for parts of string literals, which RegExpUnicodeEscapeSequence is not. I assume this language was copied from the similar errors for identifier literals, which have non-regexp UnicodeEscapeSequences (which are parts of strings literals and so do have SVs defined).

As I understand it, the intent of those errors is to make, for example, /(?<\u{1d49c}>.)/u legal, while giving an early error for /(?<\u{1f602}>.)/u. (Those code points are 𝒜 and 😂 respectively; the relevant fact is that the first of these is ID Start, while the second is not.) So an adequate fix would be to use CharacterValue instead, which is defined for parts of RegExp literals rather than for parts of string literals.

There is a slight complication here, which is that unicode RegExp literals allow you to express non-BMP code points as pairs of \u escapes, which identifier literals do not: so /(?<\ud835\udc9c>.)/u matches the grammar. I am pretty sure the intent is that this should be legal as well (which using CharacterValue in place of of SV would accomplish); that's what major engines (other than XS) which have implemented named capture groups do:

eshost -e '"x".match(/(?<\ud835\udc9c>.)/u).groups["𝒜"]'

gives

#### Chakra

SyntaxError: Unexpected quantifier

#### JavaScriptCore
x

#### SpiderMonkey

SyntaxError: invalid regexp group:

#### V8
x

#### XS

SyntaxError: (?<\ud835\udc9c invalid name

cc @littledan @mathiasbynens to confirm that the intended behavior is for /(?<\ud835\udc9c>.)/u to be legal.

Edit: actually, you also have to worry about non-unicode regexen. Safari and Chrome both treat /(?<\ud835\udc9c>.)/ as a syntax error despite allowing /(?<\ud835\udc9c>.)/u.

The text was updated successfully, but these errors were encountered:

mathiasbynens · 2020-02-01T08:50:14Z

The intention was to preserve the errors for (non-RegExp) Identifiers. Since var \uD835\uDC9C is invalid, I’d expect /(?<\uD835\uDC9C>.)/u to be invalid too.

bakkot · 2020-02-01T08:55:04Z

@mathiasbynens

I’d expect /(?<\uD835\uDC9C>.)/u to be invalid too

That runs contrary to the linked

It seems surprising and a little overly complicated to not follow the RegExp syntax for Unicode escapes when in a RegExp.

(because \u{1d49c} and \ud835\udc9c are generally functionally identical in unicode-mode RegExps)

mathiasbynens · 2020-02-01T09:08:06Z

It boils down to https://tc39.es/ecma262/#sec-identifier-names-static-semantics-early-errors, which operates on a single \uXXXX escape sequence at a time. That is, it never treats a pair of escaped surrogates as a single code point. Thus, outside of regular expressions:

var 𐊧; // U+102A7 CARIAN LETTER A2
// → valid
var \u{102A7}; // U+102A7 CARIAN LETTER A2
// → valid
var \uD800\uDEA7; // U+102A7 represented as a surrogate pair
// → invalid

When Dan said:

If we do this splitting out, we need to make sure to maintain the errors for bad identifiers.

…it seems these errors fall under that.

mathiasbynens · 2020-02-01T09:09:53Z

(FYI: I don’t feel strongly on this at all; just explaining my interpretation. I don’t believe the choice we make here really matters in practice.)

mathiasbynens · 2020-02-07T17:54:07Z

Note that to maintain symmetry between MemberExpressions a la groups.bar (where bar must be a valid IdentifierName) and the regexp pattern source itself, we’d have to ban /(?<\uD835\uDC9C>.)/u. Then, we get the following:

'x'.match(/(?<𝒜>.)/u).groups.𝒜;
// → does not throw
'x'.match(/(?<\uD835\uDC9C>.)/u).groups.𝒜;
// → throws
'x'.match(/(?<𝒜>.)/u).groups.\uD835\uDC9C;
// → throws
var \uD835\uDC9C;
// → throws

…which IMHO makes sense.

bakkot · 2020-02-08T00:05:15Z

I agree that would be more consistent, though personally I'm kind of inclined to just go with what Chrome and Safari already implement.

littledan · 2020-03-25T22:27:54Z

Sorry for my delay on this issue! I'm a bit backed up on GitHub notifications. Please reach out to me by email, IRC or Twitter DM if I'm unresponsive here.

Oof, looking back, maybe we should've just kept things simple, as @mathiasbynens and @schuay argued. It's not like we were able to reuse the Identifier grammar.

Well, now that we're here, I really wasn't thinking through whether we'd allow these split surrogate pair escapes, and I don't have much of an opinion on it. The CharacterValue feels like a "natural" choice (so we don't make multiple notions within RegExps of interpreting characters/escapes), and I'm happy to go with the majority opinion.

mathiasbynens · 2020-03-31T22:59:18Z

In general, it seems bad to expose the unfortunate concept of surrogates in more places.

…up names now (#1869) This commit makes legal all of the following: - `/(?<\ud835\udc9c>.)/` - `/(?<\ud835\udc9c>.)/u` - `/(?<\u{1d49c}>.)/` - `/(?<\u{1d49c}>.)/u` - `/(?<𝒜>)/` - `/(?<𝒜>)/u` Fixes #1861

…up names (#1869) This commit makes legal all of the following: - `/(?<\ud835\udc9c>.)/` - `/(?<\ud835\udc9c>.)/u` - `/(?<\u{1d49c}>.)/` - `/(?<\u{1d49c}>.)/u` - `/(?<𝒜>)/` - `/(?<𝒜>)/u` Fixes #1861

…up names (#1869) This commit makes the Early Errors for RegExpIdentifierStart and RegExpIdentifierPart fully specified, with the semantics that Unicode escape sequences of the form `\u LeadSurrogate \u TrailSurrogate` as well as \u { CodePoint }` are legal in named capture group names for both Unicode and non-Unicode regular expressions. This commit thus makes legal all of the following: - `/(?<\ud835\udc9c>.)/` - `/(?<\ud835\udc9c>.)/u` - `/(?<\u{1d49c}>.)/` - `/(?<\u{1d49c}>.)/u` - `/(?<𝒜>)/` - `/(?<𝒜>)/u` Fixes #1861

bakkot mentioned this issue Feb 1, 2020

Editorial: make callers of SV consistent #1862

Merged

bakkot added the spec bug label Feb 1, 2020

bakkot mentioned this issue Feb 1, 2020

Normative: plug some holes in the coverage of syntax-directed operations #1301

Merged

bakkot mentioned this issue Feb 10, 2020

Normative: Fully specify legal escape sequences in RegExp capture group names #1869

Merged

bakkot changed the title ~~SV(RegExpUnicodeEscapeSequence) is used but not defined~~ the SV of RegExpUnicodeEscapeSequence is used but not defined Feb 21, 2020

ljharb closed this as completed in 249c466 Apr 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`the SV of RegExpUnicodeEscapeSequence` is used but not defined #1861

`the SV of RegExpUnicodeEscapeSequence` is used but not defined #1861

bakkot commented Feb 1, 2020 •

edited

Loading

mathiasbynens commented Feb 1, 2020

bakkot commented Feb 1, 2020

mathiasbynens commented Feb 1, 2020

mathiasbynens commented Feb 1, 2020

mathiasbynens commented Feb 7, 2020

bakkot commented Feb 8, 2020

littledan commented Mar 25, 2020 •

edited

Loading

mathiasbynens commented Mar 31, 2020

the SV of RegExpUnicodeEscapeSequence is used but not defined #1861

the SV of RegExpUnicodeEscapeSequence is used but not defined #1861

Comments

bakkot commented Feb 1, 2020 • edited Loading

mathiasbynens commented Feb 1, 2020

bakkot commented Feb 1, 2020

mathiasbynens commented Feb 1, 2020

mathiasbynens commented Feb 1, 2020

mathiasbynens commented Feb 7, 2020

bakkot commented Feb 8, 2020

littledan commented Mar 25, 2020 • edited Loading

mathiasbynens commented Mar 31, 2020

`the SV of RegExpUnicodeEscapeSequence` is used but not defined #1861

`the SV of RegExpUnicodeEscapeSequence` is used but not defined #1861

bakkot commented Feb 1, 2020 •

edited

Loading

littledan commented Mar 25, 2020 •

edited

Loading