Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple pattern blows out JIT stack #156

Open
kohler opened this issue Nov 2, 2022 · 13 comments
Open

Simple pattern blows out JIT stack #156

kohler opened this issue Nov 2, 2022 · 13 comments

Comments

@kohler
Copy link

kohler commented Nov 2, 2022

Hi! Thanks so much for this project.

The following relatively simple pattern has unexpected bad behavior on JIT:

/\A(?:[^\\')]|'(?:[^'\\]|\\[\S\s])*')*/s

On long strings, non-JIT PCRE performs much faster than JIT, and on an ~8192-character string, PCRE overflows the default JIT stack size.

Failing test available here: kohler@c142ad0

@kohler
Copy link
Author

kohler commented Nov 2, 2022

This semantically equivalent regex seems to behave OK:

/\A(?:[^\\')]*(?:'(?:[^'\\]|\\[\S\s])*'|))*/s

@PhilipHazel
Copy link
Collaborator

Which release of PCRE? And what is the content of the strings you are matching? (Do they match or not match?) Using the current 10.40 release I do not see a slowdown with a string consisting of "abcde" repeated many times, but I do see the JIT stack overflow when the repeat is more than 272. The JIT maintainer may wish to comment.

@kohler
Copy link
Author

kohler commented Nov 3, 2022

Many releases of PCRE, including HEAD. The content matches. Please see this following commit, to PCRE2’s testdata/testinput1, for an example string in pcre2test format.

kohler@c142ad0

@zherczeg
Copy link
Collaborator

zherczeg commented Nov 4, 2022

This looks like a * * test, where every character can be matched. The [\S\s] looks like a dotall. This requires a lot of stack data. Atomic blocks should help. Is there any optimization which optimizes this case in the interpreter? I can check this later.

@PhilipHazel
Copy link
Collaborator

I don't think there's anything special about this in the interpreter. (But my memory isn't what it used to be.)

@kohler
Copy link
Author

kohler commented Nov 4, 2022

I was surprised by this behavior mostly because this grammar doesn’t theoretically require any lookbehind or backtracking at all.

image

@zherczeg
Copy link
Collaborator

zherczeg commented Nov 4, 2022

Regex engines have different engine types:
https://zherczeg.github.io/sljit/regex_compare.html

PCRE2 is more like a classic programming language, where you define the operations. For example, if you implement a bubble sort in C, and quicksort in JS, you cannot say C is slow just because JS is faster in this case.

@kohler
Copy link
Author

kohler commented Nov 5, 2022

@zherczeg, is there no special case in the backtracking engine for a pattern like (a<PAT1>|[^a]<PAT2>), where the two branches of the choice are disjoint, and the correct branch can be detected with one character of lookahead?

If there is no such special case currently, do you think adding one would be interesting, and do you have any advice about how to do so?

@kohler
Copy link
Author

kohler commented Nov 5, 2022

In particular, call the case I care about (A|B)*C. Here:

  • The match is greedy.
  • The first character in A cannot start B or C. (NB this includes the case when C is empty at the end of the pattern.)
  • Either A is fixed-length or the last character in A cannot start B or C.
  • The first character in B cannot start A or C.
  • Either B is fixed-length or the last character in B cannot start A or C.

Given this, as soon as the ith repetition of (A|B)* matches, it is safe to throw away the backtracking information associated with the 1st-ith repetitions of (A|B)*. The final match will either fail or start with ≥i repetitions of (A|B).

@zherczeg
Copy link
Collaborator

zherczeg commented Nov 5, 2022

These are static compiler optimizations, and there is a project for that:
https://github.com/zherczeg/repan

If you implement your suggested optimization there, it can rewrite:
/\A(?:[^\\')]|'(?:[^'\\]|\\[\S\s])*')*/s -> /\A(?:[^\\')]|'(?:[^'\\]|\\[\S\s])*+')*+/s

The key thing: it does not matter which engine you use, it runs faster, because, as you said, the engine can throw away the backtracking info. The generic use case: you have your patter set, you use repan at compile time to generate their optimized version, and use them at runtime. In this case doing complex regex optimizations is basically free.

@JetXujing
Copy link
Contributor

I have simplified the problem reproduction method. I can also reproduce the problem by using the following method.

[root@localhost ~]# cat re
/([^A]|B)*/
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

[root@localhost ~]# pcre2test -jit re
PCRE2 version 10.39 2021-10-29
/([^A]|B)*/
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Failed: error -46: JIT stack limit reached

In addition, I found that when the string length is 1023, no error is reported, but when the string length is 1024, the error is reported. It looks like 1024 is the threshold.

@zherczeg
Copy link
Collaborator

The default stack is 32K, you should use the jit stack to allocate more. But the pattern is still inefficient. If you make a bubble sort in C, and it is slow, don't blame the compiler.

@JetXujing
Copy link
Contributor

JetXujing commented Apr 28, 2023

Yes, I tried to use the jitstack parameter and it didn't report an error.

[root@localhost ~]# pcre2test -jit re
PCRE2 version 10.39 2021-10-29
/([^A]|B)*/jitstack=1024
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXPXX
 0: XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXPXX
 1: X

I tried the kohler/pcre2@c142ad0 case and added jitstack to the regular expression. No error was reported.
/\A(?:[^\\')]|'(?:[^'\\]|\\[\S\s])*')*/s,subject_literal,jitstack=8192

Can we consider adding a reminder log: the stack space used by default is not enough, consider using heap space through jitstack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants