-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make (*THEN) work - first draft #22215
base: blead
Are you sure you want to change the base?
Conversation
Currently, a global flag (RExC_seen |= REG_CUTGROUP_SEEN;) is set in regcomp.c when (*THEN) is encountered. Then in S_regmatch() and when encountering a BRANCH node, a choice point is puhsed with PUSH_YES_STATE_GOTO if the cutgroup flag is set, otherwise a normal choice point is pushed with PUSH_STATE_GOTO. Since this flag is global to the regex, every BRANCH node of the regex will push will a "yes state" if the regex contain a single (*THEN), instead of BRANCH nodes pushing regular choice points. Given that "sayNO" (e.g. simple backtracking) performs a cut if a "yes state" (choice point and cut point) remains and if no_final is true (which it is because backtracking after having encountered (*THEN), to the choice point pushed by (*THEN) causes the regex engine to execute the op CUTGOUP_next_fail which assigns no_final to 1) and that every alternation is also a new cut point, the cut operation has the effect to perform a simple backtracking operation as far as alternations are concerned, because every alternation pushes a choice point/cut point (e.g "yes state"). Normally, only the BRANCH node of the alternation directly containing (*THEN) should push a "yes state". One solution is to mark the appropriate BRANCH as being the outtermost alternation of a (*THEN) group and check this mark/flag inside case BRANCH (which it already does currently) and push a new choice point or a choice point+cut point accordingly with PUSH_YES_STATE_GOTO or respectively PUSH_STATE_GOTO. An alternative solution is to create a distinct BRANCH_cutgroup regnode type which would always push a choice point+cut point while the BRANCH regnode would always push a regular choice point. In order to mark BRANCH nodes as containing a (*THEN), a field (has_cutgroup) in the regex parser and compiler context (RExC_state_t) is set to 1 in regbranch() when encountering (*THEN). This field is looked up by S_reg() (regbranch()'s caller) when creating the BRANCH regnode and set the regnode_head data.u_8.flags field of this new regnode accordingly. This field (has_cutgroup) is first saved before calling regbranch(), then set to 0, then call regbranch() is made. After regbranch() returned and after the has_cutgroup field has been checked when creating the regnode, it is restored to its previous value in order to not forget having seen a previous (*THEN) group in a situation like: / (?: a (*THEN) (?: b (*THEN) | c) | d) /. !!! This commit doesn't attempt to fix the logic of TRIE regnodes concerning (*THEN) !!!
This pull request is related to the issues @demerphq and any other perl developper experienced with the regex engine should approve this pull request before merging it, and possibly modify it or follow with the TRIE modification. |
I'm not really familiar with the issue, but since this changes the behaviour of regexes, it definitely needs tests that demonstrate that. |
Yes, it changes the behavior of regexes but that new behavior is how the regexes are supposed to behave according to the documentation in So far, this commit seems to solve these issues: |
I'm sorry, there *are* more advanced tests for |
@florian-pe I'm hoping you will continue to work on this |
initialize RExC_has_cutgroup remove erroneous instructions inserted in a previous comit use the BRANCH_HAS_CUTGROUP macro
@khwilliamson Yes, I started working on it again. I thought I had the final solution but there is another piece of logic that is missing. Some of the added instructions in S_reg() to keep track of the branches containing I've fixed If the regex matching continues, visit a The Now, this is the hiccup I've found:
Both current perl and my perl branch match "ab". This is because currently, when backtracking into Apart from that, When I'm done I'll make a new clean pull requests with better commit messages. I need to setup git to be able to makes pushes on my own fork first... |
FWIW, this is on my todo list to review. Feel free to poke me to remind me to do so. |
@demerphq The refactor is about making the backtracking's logic cleaner, it's not about modifying the code of the regops themselves in the measure possible, except for the The important pieces that would be replaced are: the macros The core idea of the refactor is to make any regex pattern assertion failure simply do: The first implication of this is to modify the various The header of the loop isn't pretty but there isn't a reason to change it for now. The code of After the switch statement and inside the while loop, there would be the label and code of After the while loop, there would be additionally the labels In order to make the more complex patterns work, the backtrack stack would now contain 3 kind of frames: choice point frames, function-like frames and cut frames. The The choice point frame would be either a When backtracking, the choice_point local variable would be reset to Cut frames are never backtracked into. Their purpose is to remove zero or more choice point frames in one step. Cut frames contain a pointer to the previous cut frame and a pointer to the "previous" choice point. When performing a cut, every choice point above the cut frame will be discarded and the choice_point variable will be reset to the prev_choice_point_state field inside the cut frame as well as the previous cut state. The "previous" choice point can be last highest choice point but can also be a deeper choice point in the case of Critically, when backtracking or performing a cut, the backtrack stack "top" is reset to whatever is the highest frame in use among the newly resetted choice point, the newly resetted cut point and the current func frame that have been newly resetted in the case of a backtracking or that remained untouched in the case of a cut. There may be dead frames between the highest frame in use and the 2nd highest frame in use but won't be ever made use of again because they are no longer in the choice point frame linked list, the cut frame linked list or the func frame linked list. Func frames are directly analogous to function call frame of the function call stack of a classical imperative language. They don't contain any backtracking information. They can come from pattern calls or recursive calls but they also can be the first frame of a quantifier or be an eval frame. In the case of a quantifier, they contain the current repeat count. All func frames contain a prev_func_frame field that is equivalent to a return address in the function call stack. When doing a return, the current func_frame variable is reset but the func_frame is only "popped" if the last choice point frame is below it. If a choice point frame in use is above it, that means that a backtracking event could directly reenter this func frame, or indirectly through a caller or a callee func frame. If the last choice point is below the returning func frame but a cut point frame in use is above it (if it is even possible), the func frame is "popped" and the cut is discarded and resetted (I think). Since the backtrack stack is segmented, in order to perform a stack frame height comparison, each frame should also contain a depth field. The Looking forward, the regexp_paren_pair struct could be pushed onto the backtrack stack instead of on the save stack, (not exactly how I proposed in anoter issue BTW, what I proposed was besides the point). Basically, when encountering an OPEN or CLOSE, the specific numbered capture regexp_paren_pair struct of the logical paren array would be saved onto the backtrack stack (a 4th kind of frame), then the new start or end field would be written to in the logcial paren pair array. Each saved regexp_paren_pair struct would be prepended into a linked list whose head is the last choice point. When backtracking, each paren struct of the choice point's paren_pair linked list would be used to reset the logical paren pair array. When it comes to using these stack frames:
I don't know all the details yet concerning conditional patterns and quantifiers, but most quantifiers would push one func frame and a various amount of choice points. EDIT: |
Now that I think about it, it's possible for COMMIT, PRUNE and SKIP to perform a cut when they are visited, but if they are inside any kind of independent subpattern (real independent subpattern, lookaround or eval), they can only cut to the cut point pushed at the start of the independent subpattern, or if they are inside a branch containing a cutgroup, they can only cut to the start of the branch. Otherwise, it should be possible for them to directly cut to the bottom of backtrack stack, thereby removing all previous choice points, when they are visited. In all cases they should still push a choice point to their respective _fail state after having performed the cut, and in the case of independent subpattern and branch-with-cutgroup, that choice point pushed by COMMIT, PRUNE or SKIP should be placed above the cut point pushed by the outter pattern and the cut performed by COMMIT, PRUNE or SKIP shouldn't discard the outter pattern's cut frame in the first place. The semantics of EDIT: |
Add myself to the AUTHORS
Currently, a global flag (
RExC_seen |= REG_CUTGROUP_SEEN;
) is set in regcomp.c when(*THEN)
is encountered. Then inS_regmatch()
and when encountering aBRANCH
node, a choice point is puhsed withPUSH_YES_STATE_GOTO
if the cutgroup flag is set, otherwise a normal choice point is pushed withPUSH_STATE_GOTO
. Since this flag is global to the regex, everyBRANCH
node of the regex will push will a "yes state" if the regex contain a single(*THEN)
, instead ofBRANCH
nodes pushing regular choice points.Given that "
sayNO
" (e.g. simple backtracking) performs a cut if a "yes state" (choice point and cut point) remains and if no_final is true (which it is because backtracking after having encountered(*THEN)
, to the choice point pushed by(*THEN)
causes the regex engine to execute the opCUTGOUP_next_fail
which assigns no_final to 1) and that every alternation is also a new cut point, the cut operation has the effect to perform a simple backtracking operation as far as alternations are concerned, because every alternation pushes a choice point/cut point (e.g "yes state"). Normally, only the BRANCH node of the alternation directly containing(*THEN)
should push a "yes state".One solution is to mark the appropriate
BRANCH
as being the outtermost alternation of a(*THEN)
group and check this mark/flag inside caseBRANCH
(which it already does currently) and push a new choice point or a choice point+cut point accordingly withPUSH_YES_STATE_GOTO
or respectivelyPUSH_STATE_GOTO.
An alternative solution is to create a distinctBRANCH_cutgroup
regnode type which would always push a choice point+cut point while theBRANCH
regnode would always push a regular choice point.In order to mark
BRANCH
nodes as containing a(*THEN)
, a field (has_cutgroup
) in the regex parser and compiler context (RExC_state_t
) is set to 1 inregbranch()
when encountering(*THEN)
. This field is looked up byS_reg()
(regbranch()
's caller) when creating theBRANCH
regnode and set theregnode_head data.u_8.flags field
of this new regnode accordingly. This field (has_cutgroup
) is first saved before callingregbranch()
, then set to 0, then callregbranch()
is made. Afterregbranch()
returned and after the has_cutgroup field has been checked when creating the regnode, it is restored to its previous value in order to not forget having seen a previous(*THEN)
group in a situation like:/ (?: a (*THEN) (?: b (*THEN) | c) | d) /
.What needs to be verified is first the
S_reg()
function, make sure each regmatch() call is preceded byRExC_has_cutgroup = 0
, that theRExC_has_cutgroup
is reset tohad_cutgroup
afterregmatch()
has returned and beforeS_reg()
itself returns, and finally that all new regnodeBRANCH
take into accounthas_cutgroup
and that the setting flags to 1 doesn't create conflicts elsewhere.Secondly,
TRIE
s need to somehow conserve the information that a particularBRANCH
node was marked as containing a cutgroup, and appropriately perform the cut inTRIE_next_fail
.