Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"regular expression too large" for seemingly simple regex #119

Closed
ningvin opened this issue May 13, 2022 · 9 comments
Closed

"regular expression too large" for seemingly simple regex #119

ningvin opened this issue May 13, 2022 · 9 comments

Comments

@ningvin
Copy link

ningvin commented May 13, 2022

The following regular expression raises an error (Failed: error 120 at offset 27: regular expression is too large):

([A-Z]|[0-9]|[xyz]){1,1025}

I know the character classes can be combined into one, e.g. like so:

([A-Z0-9xyz]){1,1025}

This regular expression compiles just fine however, so do a lot of "simplified" versions of the first pattern:

([A-Z]|[0]){1,1025}
([A-Z]|[0-9]|[xyz]){1,500}

As far as I can tell the documentation only states that the numbers inside a curly-braced repetition must be less than 65536, which 1025 does not even come close to.

I am guessing there is some complexity introduced by the combination of alternatives using | and character classes including character ranges, but I wanted to check anyway: is this the intended behavior?

I can reproduce this behavior on both my local machine using pcre2test (build of the master branch with default settings on Linux), as well as https://regex101.com/ with the PCRE2 engine.

@PhilipHazel
Copy link
Collaborator

This is, I'm afraid, expected behaviour. When there is a group that has a fixed iteration limit, it gets replicated in the generated code, so the larger the upper limit, the longer the generated code. Use the pcre2test "memory" option to see memory usage, for example: /abcd/memory as a pattern, or pcre2test -s memory . Using a set of alternatives makes the basic group longer, and hence more memory is needed. In the 8-bit library, 16-bit values are used for lengths within the compiled pattern, thus limiting its size to around 64K. However, you can configure PCRE2 --with-link-size=3 (or 4) to use larger internal links, in which case the compiled pattern can be much bigger. There are no plans to change any of this, so I'm going to close this issue.

@ningvin
Copy link
Author

ningvin commented May 13, 2022

Hi Philip, thanks for the quick response. My curiosity has been satisfied :-)

@JohnHerry
Copy link

JohnHerry commented Mar 2, 2023

Hi, Philip, We had got the same error message, but our re string is really very large:

PCRE2 compilation failed at offset 376104: code 120 msg regulare expression is too large

Then how to adjust compile option to meet our need of super long regular string? thanks.

Edit: our expression contains a large group of keywords. like:
(Hello|World|....... A large amout of key words)

@zherczeg
Copy link
Collaborator

zherczeg commented Mar 2, 2023

Please read the comments above. The --with-link-size is the compile time option

@JohnHerry
Copy link

Yes, we had config this option with the max value , 4. it is not the solution in our case.

more info, "PCRE2 compilation failed at offset 376104: code 120 msg regulare expression is too large", here 376104 is exactly the regex string length in our case.

and this exception happens occasionally, not always.

@PhilipHazel
Copy link
Collaborator

It must be a truly large regex if link size 4 cannot handle it. I'm afraid that is the absolute limit. However, 376104 is nowhere near the limit for a 32-bit number, so I'm wondering if something else is going on here. You would get that error with the default link size of 2. Are you sure you are linking with a version of PCRE2 that is compiled with --with-link-size=4? Or could it be accidentally linking with a system PCRE2 that has the default?

@JohnHerry
Copy link

JohnHerry commented Mar 3, 2023

It must be a truly large regex if link size 4 cannot handle it. I'm afraid that is the absolute limit. However, 376104 is nowhere near the limit for a 32-bit number, so I'm wondering if something else is going on here. You would get that error with the default link size of 2. Are you sure you are linking with a version of PCRE2 that is compiled with --with-link-size=4? Or could it be accidentally linking with a system PCRE2 that has the default?

in the pcre2.cmake, we had add some env like that

set(PCRE2_LINK_SIZE 4)
set(PCRE2_BUILD_PCRE2GREP OFF)
set(PCRE2_BUILD_TESTS OFF)

so it is confirmed that with-link-size=4; and without this setting, our large regex string will get compiling failed all the time.

@zherczeg
Copy link
Collaborator

zherczeg commented Mar 3, 2023

The question wasn't how you compiled pcre2, but whether you use the compiled pcre or accidentally the system pcre. Btw the compiled pcre2 has a pcre2test tool, you can try your regex there.

@JohnHerry
Copy link

The question wasn't how you compiled pcre2, but whether you use the compiled pcre or accidentally the system pcre. Btw the compiled pcre2 has a pcre2test tool, you can try your regex there.

OK, we will have a try. thanks every body for helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants