Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the SV of RegExpUnicodeEscapeSequence is used but not defined #1861

Closed
bakkot opened this issue Feb 1, 2020 · 8 comments · Fixed by #1869
Closed

the SV of RegExpUnicodeEscapeSequence is used but not defined #1861

bakkot opened this issue Feb 1, 2020 · 8 comments · Fixed by #1869
Labels

Comments

@bakkot
Copy link
Contributor

bakkot commented Feb 1, 2020

In the Early Errors for RegExp literals there are two references to the SV of RegExpUnicodeEscapeSequence. Presumably this is intended to be a reference to 11.8.4.2 Static Semantics: SV, but that section only gives SVs for parts of string literals, which RegExpUnicodeEscapeSequence is not. I assume this language was copied from the similar errors for identifier literals, which have non-regexp UnicodeEscapeSequences (which are parts of strings literals and so do have SVs defined).

As I understand it, the intent of those errors is to make, for example, /(?<\u{1d49c}>.)/u legal, while giving an early error for /(?<\u{1f602}>.)/u. (Those code points are 𝒜 and 😂 respectively; the relevant fact is that the first of these is ID Start, while the second is not.) So an adequate fix would be to use CharacterValue instead, which is defined for parts of RegExp literals rather than for parts of string literals.

There is a slight complication here, which is that unicode RegExp literals allow you to express non-BMP code points as pairs of \u escapes, which identifier literals do not: so /(?<\ud835\udc9c>.)/u matches the grammar. I am pretty sure the intent is that this should be legal as well (which using CharacterValue in place of of SV would accomplish); that's what major engines (other than XS) which have implemented named capture groups do:

eshost -e '"x".match(/(?<\ud835\udc9c>.)/u).groups["𝒜"]'

gives

#### Chakra

SyntaxError: Unexpected quantifier

#### JavaScriptCore
x

#### SpiderMonkey

SyntaxError: invalid regexp group:

#### V8
x

#### XS

SyntaxError: (?<\ud835\udc9c invalid name

cc @littledan @mathiasbynens to confirm that the intended behavior is for /(?<\ud835\udc9c>.)/u to be legal.

Edit: actually, you also have to worry about non-unicode regexen. Safari and Chrome both treat /(?<\ud835\udc9c>.)/ as a syntax error despite allowing /(?<\ud835\udc9c>.)/u.

@mathiasbynens
Copy link
Member

The intention was to preserve the errors for (non-RegExp) Identifiers. Since var \uD835\uDC9C is invalid, I’d expect /(?<\uD835\uDC9C>.)/u to be invalid too.

@bakkot
Copy link
Contributor Author

bakkot commented Feb 1, 2020

@mathiasbynens

I’d expect /(?<\uD835\uDC9C>.)/u to be invalid too

That runs contrary to the linked

It seems surprising and a little overly complicated to not follow the RegExp syntax for Unicode escapes when in a RegExp.

(because \u{1d49c} and \ud835\udc9c are generally functionally identical in unicode-mode RegExps)

@mathiasbynens
Copy link
Member

It boils down to https://tc39.es/ecma262/#sec-identifier-names-static-semantics-early-errors, which operates on a single \uXXXX escape sequence at a time. That is, it never treats a pair of escaped surrogates as a single code point. Thus, outside of regular expressions:

var 𐊧; // U+102A7 CARIAN LETTER A2
// → valid
var \u{102A7}; // U+102A7 CARIAN LETTER A2
// → valid
var \uD800\uDEA7; // U+102A7 represented as a surrogate pair
// → invalid

When Dan said:

If we do this splitting out, we need to make sure to maintain the errors for bad identifiers.

…it seems these errors fall under that.

@mathiasbynens
Copy link
Member

(FYI: I don’t feel strongly on this at all; just explaining my interpretation. I don’t believe the choice we make here really matters in practice.)

@mathiasbynens
Copy link
Member

Note that to maintain symmetry between MemberExpressions a la groups.bar (where bar must be a valid IdentifierName) and the regexp pattern source itself, we’d have to ban /(?<\uD835\uDC9C>.)/u. Then, we get the following:

'x'.match(/(?<𝒜>.)/u).groups.𝒜;
// → does not throw
'x'.match(/(?<\uD835\uDC9C>.)/u).groups.𝒜;
// → throws
'x'.match(/(?<𝒜>.)/u).groups.\uD835\uDC9C;
// → throws
var \uD835\uDC9C;
// → throws

…which IMHO makes sense.

@bakkot
Copy link
Contributor Author

bakkot commented Feb 8, 2020

I agree that would be more consistent, though personally I'm kind of inclined to just go with what Chrome and Safari already implement.

@bakkot bakkot changed the title SV(RegExpUnicodeEscapeSequence) is used but not defined the SV of RegExpUnicodeEscapeSequence is used but not defined Feb 21, 2020
@littledan
Copy link
Member

littledan commented Mar 25, 2020

Sorry for my delay on this issue! I'm a bit backed up on GitHub notifications. Please reach out to me by email, IRC or Twitter DM if I'm unresponsive here.

Oof, looking back, maybe we should've just kept things simple, as @mathiasbynens and @schuay argued. It's not like we were able to reuse the Identifier grammar.

Well, now that we're here, I really wasn't thinking through whether we'd allow these split surrogate pair escapes, and I don't have much of an opinion on it. The CharacterValue feels like a "natural" choice (so we don't make multiple notions within RegExps of interpreting characters/escapes), and I'm happy to go with the majority opinion.

@mathiasbynens
Copy link
Member

In general, it seems bad to expose the unfortunate concept of surrogates in more places.

ljharb pushed a commit that referenced this issue Apr 2, 2020
…up names now (#1869)

This commit makes legal all of the following:
 - `/(?<\ud835\udc9c>.)/`
 - `/(?<\ud835\udc9c>.)/u`
 - `/(?<\u{1d49c}>.)/`
 - `/(?<\u{1d49c}>.)/u`
 - `/(?<𝒜>)/`
 - `/(?<𝒜>)/u`

Fixes #1861
ljharb pushed a commit that referenced this issue Apr 2, 2020
…up names (#1869)

This commit makes legal all of the following:
 - `/(?<\ud835\udc9c>.)/`
 - `/(?<\ud835\udc9c>.)/u`
 - `/(?<\u{1d49c}>.)/`
 - `/(?<\u{1d49c}>.)/u`
 - `/(?<𝒜>)/`
 - `/(?<𝒜>)/u`

Fixes #1861
ljharb pushed a commit that referenced this issue Apr 2, 2020
…up names (#1869)

This commit makes the Early Errors for RegExpIdentifierStart and RegExpIdentifierPart fully specified, with the semantics that Unicode escape sequences of the form `\u LeadSurrogate \u TrailSurrogate` as well as \u { CodePoint }` are legal in named capture group names for both Unicode and non-Unicode regular expressions.

This commit thus makes legal all of the following:
 - `/(?<\ud835\udc9c>.)/`
 - `/(?<\ud835\udc9c>.)/u`
 - `/(?<\u{1d49c}>.)/`
 - `/(?<\u{1d49c}>.)/u`
 - `/(?<𝒜>)/`
 - `/(?<𝒜>)/u`

Fixes #1861
@ljharb ljharb closed this as completed in 249c466 Apr 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants