Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support y sticky flag for regular expressions #904

Closed
getify opened this issue Feb 27, 2015 · 43 comments
Closed

Support y sticky flag for regular expressions #904

getify opened this issue Feb 27, 2015 · 43 comments
Labels
outdated A closed issue/PR that is archived due to age. Recommended to make a new issue

Comments

@getify
Copy link

getify commented Feb 27, 2015

I did a quick glance at open and past-closed issues and didn't find this mentioned (sorry if I missed it)... would be great to know the plans for support of:

var re = /o+/y;

AFAICT, the sticky flag y makes the regular expression remember the last match's end and starts from there on subsequent matches (similar to the g flag with exec(..)).

FYI: I find the MDN page for this to be particularly unhelpful in explaining it, as the example cited there is basically the same as a g + exec(..) example.

@sebmck
Copy link
Contributor

sebmck commented Feb 27, 2015

Hm, not sure how this would be implemented. It'd have to be used in conjunction with a polyfill/stdlib like core-js that would shim in the regex behaviour to the regex instance methods and Babel would then need a way to let it know that the regex is sticky. /cc @zloirock

Also yeah, your interpretation of sticky regexes is correct. MDN is a bit vague sometimes and not very user friendly.

@developit
Copy link
Member

XRegExp has an implementation:
http://xregexp.com/flags/
https://github.com/slevithan/xregexp/blob/master/src/xregexp.js#L60
(though it is not a polyfill)

@sebmck
Copy link
Contributor

sebmck commented Feb 27, 2015

@developit Their sticky regex detection is bad and will never correctly work.

@sebmck
Copy link
Contributor

sebmck commented Feb 27, 2015

Could probably just do something like:

var foo = /foo/y;
var _regex;
var foo = _regex = /foo/, (_regex.sticky = true), _regex; // might have to be `_sticky` or something, not sure how it'll play across environments

and then leave it up to core-js to deal with. What do you think @zloirock, worth it? Might be tricky to get it to work consistently across all the different scenarios.

@neVERberleRfellerER
Copy link
Contributor

If efficiency is sacrificiable, we can do some small (mabe not so small) evil:

var _reExec = (function () {
  var _exec = RegExp.prototype.exec;
  return function exec(str) {
    str = str.substr(this.lastIndex || 0);
    var match = _exec.call(this, str);
    if (!match) return match;
    this.lastIndex += str.indexOf(match[0]) + match[0].length;
    return match;
  };
})();

var regex = (function () {
  var re = /(\S+) line\n?/; // remove y from modifiers
  re.exec = _reExec;
  return re;
})();

@sebmck
Copy link
Contributor

sebmck commented Feb 27, 2015

@neVERberleRfellerER You can pass a regex to "".split, "".match etc so that only covers one use-case.

@getify
Copy link
Author

getify commented Feb 27, 2015

FYI: the spec says that split(..) in particular ignores the sticky flag.

@neVERberleRfellerER
Copy link
Contributor

@sebmck Indeed, there is always a catch. And it did not even crossed my mind. I should get a break. However, I do not get how split for example could work with this.
Update split solved. Thank you, @getify

@sebmck
Copy link
Contributor

sebmck commented Feb 27, 2015

@getify Ah great, haven't had a need to read that part yet. Still need to take into consideration match though which makes things kinda tricky 😟

@sebmck
Copy link
Contributor

sebmck commented Feb 27, 2015

@neVERberleRfellerER Neither do I, was just throwing around methods that can accept regexes.

@getify
Copy link
Author

getify commented Feb 27, 2015

I'm not positive if my interpretations here are entirely correct, but I wrote up some details about how I think sticky works:

https://github.com/getify/You-Dont-Know-JS/blob/master/es6%20&%20beyond/ch2.md#sticky-flag

Help in vetting that info would be appreciated, as you explore this possible fix for Babel.

@neVERberleRfellerER
Copy link
Contributor

I have just tested this in io.js (according to MDN V8 supports the 'y' flag when turned on by flag --- mindblown --- without comments) and Firefox (supports 'y', but has bugged '^' with it, does not matter here), both have same results:

var re = /foo/y;
var str = 'foofoofoo';

re.exec(str);
[ 'foo', index: 0, input: 'foofoofoo' ]

re.exec(str);
[ 'foo', index: 3, input: 'foofoofoo' ]

re.exec(str);
[ 'foo', index: 6, input: 'foofoofoo' ]

re.exec(str);
null

Neither Firefox nor io.js work with samples on @getify s page. I do not know why, but I am still testing.

Update:
To make the first search work, I had to set lastIndex to 1 before first use. And even then it does not work with string methods. Only with RegExp methods.

var re = /o+./y;
var str = "foot book more";
re.lastIndex = 1;

re.exec(str);
> [ 'oot', index: 1, input: 'foot book more' ]

To make the next search working, i have to set last index again - to the index of next expected match,

Update 2: According to description on MDN this behavior is correct.

@getify
Copy link
Author

getify commented Feb 27, 2015

@neVERberleRfellerER

I don't trust the MDN page at all. I also don't trust V8's implementation.

Also, I am not particularly interested in whether exec(..) works or not, because we already know it works with the g global flag.

What's interesting is whether it works with all the other regex-using methods. The spec says that all of them should funnel through the internal exec algorithm, so they should all work (as I was depicting -- I think). But I don't trust that engines are actually sharing that code.

@neVERberleRfellerER
Copy link
Contributor

@getify
It works with every RegExp method, not only exec. It does not work with string.match. This probably comes from the fact, that after intercepting RegExp methods, it does not call any of them (the only interesting method is RegExps match - this is the only one that should be really called) and works as if nothing have changed in both FF and V8.

@neVERberleRfellerER
Copy link
Contributor

@getify I believe that implementations are correct with the behavior decribed in my comment with two updates.

15. Repeat, while matchSucceeded is false
    ...
    b. Let r be matcher(S, lastIndex).
    c. If r is failure, then
        i.  If sticky is true, then
             1. Let setStatus be Set(R, "lastIndex", 0, true)
             2. ReturnIfAbrupt(setStatus).
             3. Return null.

@zloirock
Copy link
Member

I thought about it. Consider proposals. Please, open issue in core-js repo.

@developit XRegExp uses native in FF, but not adds implementation.

XRegExp('.', 'y')
SyntaxError: invalid regular expression flag y

@sebmck be better at first select a way. Maybe replace it to constructor call, /foo/y -> RegExp('foo', 'y'), be better.

@sebmck
Copy link
Contributor

sebmck commented Feb 27, 2015

@zloirock Ah yeah, that's a much better way actually.

@getify
Copy link
Author

getify commented Feb 27, 2015

@neVERberleRfellerER

I apologize if I'm dense and missing something, but my reading of the ES6 spec draft does not match (pun intended) your interpretations, if I understand them. Let me observe some things about how I'm reading it:

  1. 21.1.3.11: String.prototype.match(..) looks for, and finding it, delegates to the @@match on the regex.
  2. 21.2.5.6: RegExp.prototype[@@match] first looks for if the expression is global, and if not (21.2.5.6.7), it delegates to the abstract operation RegExpExec.
  3. 21.2.5.2: RegExpExec calls exec(..) on the regular expression, then delegates to RegExpBuiltinExec to process the results.
  4. 21.2.5.2.2: RegExpBuiltinExec of course is our main exec abstract operation, which checks (21.2.5.2.2.8) sticky and, skipping through your quoted step 15 because matchSucceeded gets set to true, proceeds to use sticky (21.2.5.2.2.18), which updates lastIndex. Rinse, repeat.

What am I missing here? I sure read that to be that String.prototype.match funnels into the sticky-respecting exec(..). No?

[Edit]: Just remembered I had, on a different topic entirely, recently asked this question of @allenwb, and he'd confirmed that String#match(..) does in fact delegate to exec(..).

sebmck added a commit that referenced this issue Feb 27, 2015
@sebmck
Copy link
Contributor

sebmck commented Feb 27, 2015

Added the sticky regex desugaring suggested by @zloirock so it can potentially be delegated to core-js etc. ie.

var foo = /bar/y;

turns into

var foo = new RegExp("bar", "y");

@neVERberleRfellerER
Copy link
Contributor

@getify

What am I missing here?

Step 21.2.5.2.2.14 probably? matchSucceeded is set to false and following steps are as I have said.

String.prototype.match funnels into the sticky-respecting exec(..)

As stated in 21.1.3.11 String.prototype.match uses @ @ match which is defined in 6.1.5.1 as

A regular expression method that matches the regular expression against a string.

So yes, you are right with this. However:

Let matcher be GetMethod(regexp, @ @ match) // from 21.1.3.11

is not correct in FF and V8 (warning - source: own research).

Update Sorry for tagging user match.

Update 2 Everything was updated. Sorry for possible misunderstanding and confusion.

Update 3 For example, this works in both FF and V8

> 'kool'.match({ source: 'o+.' })
[ 'o', index: 1, input: 'kool' ]

@developit
Copy link
Member

Swapping the literal for a constructor looks perfect, wish I'd thought of that.

@getify
Copy link
Author

getify commented Feb 27, 2015

@neVERberleRfellerER

is not correct in FF and V8 (warning - source: own research).

Perhaps I've misunderstood even your premise of discussion. Is your position:

  • We disagree about the spec requiring that String#match(..) _should_ funnel to RegExp#exec(..)? I could disagree/clarify a couple of your assertions (like 21.2.5.2.2.14 and 6.1.5.1), but I'm wondering if I even need to go down that path?
  • Or... that I'm correct that by spec such delegation should ocur, but that FF and v8 currently do not and that they're wrong as a result?

@neVERberleRfellerER
Copy link
Contributor

@getify

Or... that I'm correct that by spec such delegation should ocur, but that FF and v8 currently do not and that they're wrong as a result?

This.

However, the most important part of discussion is the behavior of sticky regexpes (for example when using exec directly). My stance and review of discussed problem (to make sure that we are discussing the same thing):

Implementations are corrent, when they return null and set lastIndex to 0 when match does not coccur exactly at lastIndex. This is in conflict with your samples (https://github.com/getify/You-Dont-Know-JS/blob/master/es6%20&%20beyond/ch2.md#sticky-flag).

Your stance:

RegExpBuiltinExec of course is our main exec abstract operation, which checks (21.2.5.2.2.8) sticky and, skipping through your quoted step 15 because matchSucceeded gets set to true, proceeds to use sticky (21.2.5.2.2.18), which updates lastIndex. Rinse, repeat.

But matchSucceeded is set to false in step 14 and step 15 is evaluated. Then, if machSucceeded is set, it will get to step 18 (through step 16 and eventually 17). When match in step 15 results in failure, step 15.c.i is taken which results in null returned and lastIndex set to zero. Which is consistent with behvior of both tested implementations (FF, V8),

@getify
Copy link
Author

getify commented Feb 27, 2015

@neVERberleRfellerER

I'm glad I clarified, because I thought you were saying something entirely different and I was arguing that diff point. :)

This is in conflict with your samples

Wait, how is it in conflict with my examples? Both my g+exec(..) and my y examples show the last failed match returning lastIndex to 0, as I quote myself:

...
re.exec( str );     // null -- no more matches!
re.lastIndex;       // 0 -- starts over now!

and

...
re.test( str );     // false -- no more matches!
re.lastIndex;       // 0 -- starts over now!

Isn't that not indeed in agreement both with you and the spec?

[Edit]: All the stuff in your previous message, where you quote "my stance"... none of that was arguing against lastIndex === 0 at the end of all matches, it was arguing FOR sticky when there's still matches left. :)

@neVERberleRfellerER
Copy link
Contributor

@getify Problem is in this (taken from 'y' samples)

var re = /o+./y,    // <-- look, `y` now!
    str = "foot book more";

str.match( re );    // ["oot"]
re.lastIndex;       // 4

When match does not occur at lastIndex, it should set lastIndex to 0 (it already is in this case) and return null. So, I expect the following output:

var re = /o+./y,    // <-- look, `y` now!
    str = "foot book more";

str.match( re );    // null
re.lastIndex;       // 0

Because 15.c.i is taken in this case.

Update reaction to [edit]ed part:

No, but it was (I think at last) arguing against lastIndex === 0 and null result on first match call.

@getify
Copy link
Author

getify commented Feb 27, 2015

@neVERberleRfellerER

Ahh, further clarification of which example you were concerned about! :)

When match does not occur at lastIndex, it should set lastIndex to 0 (it already is in this case) and return null. So, I expect the following output:

I don't interpret that at all. Let me explain. From 25.2.5.2.2, we get:

...
4. Let lastIndex be ToLength(Get(R,"lastIndex")). <-- `lastIndex` is initially `0` here, right?
...
14. Let matchSucceeded be false.
15. Repeat, while matchSucceeded is false
     a. If lastIndex > length, then               <-- nope
     b. Let r be matcher(S, lastIndex).           <-- should succeed, right? `o+.` : `"oot"`
     c. If r is failure, then                     <-- nope, `r` succeeded, right?
     d. else
          ...
          ii. Set matchSucceeded to true.
16. Let e be r's endIndex value.
...
18. If global is true or sticky is true,
     a. Let setStatus be Set(R, "lastIndex", e, true).

So I guess where we're differing is 15.b... i see it succeeding, since at that point lastIndex is 0, and it can indeed find "oot" (at position 1).

What am I missing?

@neVERberleRfellerER
Copy link
Contributor

@getify

So I guess where we're differing is 15.b

Indeed. This is the only difference. It should not succeed and the reason why non-sticky regexps succeed is, that they start again from next position, since 15.c.ii is taken:

Let lastIndex = lastIndex+1.

Update Some samples (Update 2 warning: untested)

var str = "foot book more";

// this should work
var re = /.*?o+./y;
re.lastIndex = 0;
re.exec(str); // ['foot']

//this should work
var re = /.*?o+./y;
re.lastIndex = 1;
re.exec(str); // ['oot']

// this should not work
var re = /o+./y;
re.lastIndex = 0;
re.exec(str); // null

// this should work
var re = /o+./y;
re.lastIndex = 1;
re.exec(str); // ['oot']

@getify
Copy link
Author

getify commented Feb 27, 2015

@neVERberleRfellerER

It should not succeed

I still don't understand why? matcher is the internal regexp engine [[RegExpMatcher]], and it's called the very first time through, in 15.b, essentially as matcher("foot book more",0).

My understanding is that a regular expression engine does its own internal forward and back-tracking stuff looking for suitable matches, so even though it doesn't match o+. at position 0, it does internally advance to position 1, and indeed find a sufficient match there, which is the "oot", and it returns that match result as r, which would be success.

I don't see that the looping (and incrementing of lastIndex) semantics implied in step (15) are performing the actual searching in the string for the first/next match.

Am I fundamentally misunderstanding how the internal regex matcher engine works?

@neVERberleRfellerER
Copy link
Contributor

According to 21.2.3.2.2 (RegExpInitialize):

  1. Set obj’s [[RegExpMatcher]] internal slot to the internal procedure that evaluates the above parse of P by applying the semantics provided in 21.2.2

According to 21.2.2.2 (Pattern):

NOTE: A Pattern evaluates ("compiles") to an internal procedure value. RegExp.prototype.exec and other methods can then apply this procedure to a String and an offset within the String to determine whether the pattern would match starting at exactly that offset within the String

According to 21.2.5.2, the exec uses RegExpBuiltinExec only (21.2.5.2 point 7), and thus I believe, that the internal procedure mentioned in 21.2.2.2 is [[RegExpMatcher]].

Since the internal procedure from 21.2.2.2 (Pattern) matches exactly at specified offset, the 21.2.5.2.2 point 15.b should indeed fail.

Warning: this is a minefield.

@getify
Copy link
Author

getify commented Feb 27, 2015

So if I'm getting an understanding of the intent of "sticky" (I guess I haven't yet, to this point), it takes a pattern like /foo/ and implies that for a match to be made, the "f" must be at ONLY the current value of lastIndex (0 by default, or whatever it's set to later), and no further forward (or backward) in the input string.

Is that correct?

That seems to imply that "sticky" is useless for repeatedly stepping through matches (unless they're always entirely contiguous/adjacent), as is typically done with g + exec(..).

Because wouldn't this then fail?

var re = /h./y, str = "hello, how are you?"

re.lastIndex;       // 0
re.exec( str );     // ["he"]

re.lastIndex;       // 2
re.exec( str );     // null

If that's how it should work, I'm completely lost on the use cases of it? Also every example (MDN, etc) I've found implies that it's closely related to the repeated g + exec(..) usage, but it clearly is not the same.

@allenwb
Copy link

allenwb commented Feb 27, 2015

First, NOTEs are non-normative, which means that they aren't supposed to add anything (except clarification) that isn't in the normative text.

The matcher "internal procedure"produced by 21.2.2.2 matches at the offset that is passed to it via its 'index parameter. a different index can be supplied each time it is called at 15.b in 21.2.5.22 but each such call matches only at the index supplied in the call.

@allenwb
Copy link

allenwb commented Feb 27, 2015

@getify

So if I'm getting an understanding of the intent of "sticky" (I guess I haven't yet, to this point), it takes a pattern like /foo/ and implies that for a match to be made, the "f" must be at ONLY the current value of lastIndex (0 by default, or whatever it's set to later), and no further forward (or backward) in the input string.

Is that correct?

yes

If that's how it should work, I'm completely lost on the use cases of it? Also every example (MDN, etc) I've found implies that it's closely related to the repeated g + exec(..) usage, but it clearly is not the same.

It's an adjustable (via lastIndex) left anchor.

@getify
Copy link
Author

getify commented Feb 27, 2015

It's an adjustable (via lastIndex) left anchor.

I guess what I was asking is, if one is parsing a string (of presumably unknown contents -- otherwise why parse?), how does one know what to adjust lastIndex to, if the non-anchoring at the front of the pattern is not allowed to adjust it for you?

In what circumstances do you need sticky when you already know the indexes your matches are at? You can just set the index and use exec(..), and it'll match right there, right? What benefit would y provide in that case?

Sorry for being dense here. It's obvious my own misunderstandings are driving all the confusion in this thread.

@neVERberleRfellerER
Copy link
Contributor

@allenwb Thank you for making everything clear.

First, NOTEs are non-normative

This note was easiest to reference and it was compatible with my understanding that I have got from pattern and notation of continuation and matcher (the last two are not even numbered and as a result difficult to reference). And the clarification provided by it was really useful when I started diging through it, so its purpose was fullfiled, I think :-)

@getify I have found this:
http://wiki.ecmascript.org/doku.php?id=harmony:regexp_y_flag
One use case with explanation is in Rationale. The page itself is pretty old, but the idea from Proposal is same as in ES6 draft referenced in this conversation, so the same use cases should apply here.

@getify
Copy link
Author

getify commented Feb 27, 2015

also, what then is ES6's expected handling of:

var re = /^o./y,     // <---- notice the ^ here
    str = "oops nope";

re.lastIndex;        // 0
str.match( re );     // ["oo"]
re.lastIndex;        // 2

str.match( re );     // null
re.lastIndex;        // 0

re.lastIndex = 6;
str.match( re );     // ????

@neVERberleRfellerER
Copy link
Contributor

@getify I think that the last result should be null, because input is not starting at position 6.

Also take a look at this example, which I believe is correct:

var str = 'oops noaa\noops nobb\noops nocc';
var re = /^oops .o..\n?/ym; // notice the 'm' and different pattern

re.lastIndex = 10; // new line is starting at position 10

re.exec(str); // ['oops nobb\n']
re.lastIndex; // 20

re.exec(str); // ['oops nocc']
re.lastIndex; // 29

re.exec(str); // null
re.lastIndex; // 0

re.exec(str); // ['oops noaa\n']
re.lastIndex; // 10

This does not work with /g and repeated exec.
Update It does. So, the adjustable left anchor is really the only advantage of this?

@getify
Copy link
Author

getify commented Feb 27, 2015

@neVERberleRfellerER

This does not work with /g and repeated exec.

It seems to work for me (chrome) with /gm:

exec + gm

I don't see how y provides a benefit there? Did I miss it?

@neVERberleRfellerER
Copy link
Contributor

@getify My bad. I ahve missed /m when I have been testing this. I do not know. Maybe the efficiency in some cases is the only advantage.

@allenwb
Copy link

allenwb commented Feb 27, 2015

see NOTE at end of 21.2.2.6

@neVERberleRfellerER

Maybe the efficiency in some cases is the only advantage.

Probably. I wasn't the feature champion, I just had to figure out how to integrate it into the spec.

@getify
Copy link
Author

getify commented Feb 27, 2015

Maybe the efficiency in some cases is the only advantage.

Probably.

Yikes, I hope not.

Let me go back to some of my earliest assertions about the benefits of y... do these hold water?

  1. y can (partially) be emulated with g + exec(..), but the use of the g flag changes other regex-aware utilities (like String#match(..)) in perhaps-unwanted ways (match(/../g) grabs all matches all at once, instead of one-at-a-time), whereas match(/../y) will give you the sticky like benefits (lastIndex) without changing the one-at-a-time behavior of match(..).
  2. y can be seen as to restrain a behavior that's not possible to restrain with g + exec(..), which is that g + exec(..) will automatically "look forward" in the input string from wherever lastIndex currently is, perhaps finding it past lastIndex. y says "no, no, you can only find this at exactly lastIndex, or not at all".

I think those are faithful explanations of the difference/benefit of y. Sanity check?

@getify
Copy link
Author

getify commented Feb 27, 2015

@allenwb

see NOTE at end of 21.2.2.6

Yep, that's exactly the kind of clarity I needed on the ^ issue. That's put to rest now. Thanks!

@getify
Copy link
Author

getify commented Feb 28, 2015

FWIW, I've completely rewritten (and expanded) my book section on "sticky mode" after all these explorations. If it helps at all, here how I went about it:

https://github.com/getify/You-Dont-Know-JS/blob/master/es6%20&%20beyond/ch2.md#sticky-flag

@sebmck
Copy link
Contributor

sebmck commented Mar 3, 2015

Closing this now since I've basically solved this on the Babel side and it's now up to a polyfill to patch the RegExp constructor. Thanks everyone for the insightful discussion!

@sebmck sebmck closed this as completed Mar 3, 2015
@lock lock bot added the outdated A closed issue/PR that is archived due to age. Recommended to make a new issue label Jul 27, 2018
@lock lock bot locked as resolved and limited conversation to collaborators Jul 27, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
outdated A closed issue/PR that is archived due to age. Recommended to make a new issue
Projects
None yet
Development

No branches or pull requests

6 participants