Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PCRE2 JIT crash in 10.38 and 10.42 #180

Closed
genivia-inc opened this issue Dec 28, 2022 · 12 comments
Closed

PCRE2 JIT crash in 10.38 and 10.42 #180

genivia-inc opened this issue Dec 28, 2022 · 12 comments

Comments

@genivia-inc
Copy link

The regex pattern [\w-]+@([\w-]+.)+[\w-]+ causes JIT to crash when matching the pattern against some "binary" data as input.

This was first reported here: Genivia/ugrep#241

I isolated the relevant code in a small C++ file to assist, see attached poc.cpp in the poc.zip. The input file with "binary" data archive2.tgz is also attached in the poc.zip.

The crash is observed with PCRE2 versions 10.38 and 10.42 using UTF-8 matching with code unit width 8. It appears that PCRE2_JIT_PARTIAL_HARD is a potential cause. I've tested in MacOS M1, Android ARM, MacOS Intel. All crash in the JIT code. Matching without JIT works fine.

Hopefully this can be fixed.

poc.zip

@carenas
Copy link
Contributor

carenas commented Dec 28, 2022

when using pcre2 >= 10.34, it is recommended to add PCRE2_MATCH_INVALID_UTF to the options passed to pcre2_compile for this use case, and indeed doing so in your POC prevents the crash.

@genivia-inc
Copy link
Author

Thanks. A useful suggestion indeed. But I don't recall reading the PCRE2 documentation that PCRE2_MATCH_INVALID_UTF is recommended or necessary, perhaps it didn't jump out to me as important.

@carenas
Copy link
Contributor

carenas commented Dec 29, 2022

Agree it is not obvious, but it is at least implied by the documentation when it alludes to the fact that the code expects the subject to be valid UTF and even warns of the possibility of crashes, and also points to that flag as the only way to search within binary files safely (see comment at the end of the page).

@zherczeg
Copy link
Collaborator

Can we close this issue?

@zherczeg
Copy link
Collaborator

Note: PCRE2_MATCH_INVALID_UTF has a small perf overhead, so use it only when the subject string may contain invalid utf8/16 characters.

@genivia-inc
Copy link
Author

@carenas Agree it is not obvious, but it is at least implied by the documentation when it alludes to the fact that the code expects the subject to be valid UTF and even warns of the possibility of crashes, and also points to that flag as the only way to search within binary files safely (see comment at the end of the page).

I would believe that others will run into this problem too eventually, especially if this was changed since 10.34 (I believe, based on comments). This is a situation where the default settings should be safe, rather than the other way round. Better be safe than sorry, as they say. If people want to search UTF-encoded data they fully understand and control/own, then it makes sense NOT to use PCRE2_MATCH_INVALID_UTF. Otherwise, this flag should be the default whenever PCRE2_UTF is used. Perhaps a flag like PCRE2_ONLY_VALID_UTF would be helpful if the default is PCRE2_MATCH_INVALID_UTF, i.e. assuming only valid UTF input to speed up matching? Just my 2 cents.

@zherczeg PCRE2_MATCH_INVALID_UTF has a small perf overhead, so use it only when the subject string may contain invalid utf8/16 characters.

For the ugrep search tool with option -P I will make the PCRE2_MATCH_INVALID_UTF the default when UTF patterns are matched with PCRE2_UTF and PCRE2_UCP (matching ASCII by turning Unicode matching off with -U will not use these flags obviously). There is no way we can guarantee that the input to search is valid UTF (at least not without doing a full pass over it, which is expensive). The ugrep tool does match "binary" files like GNU grep does, and also checks if a file is "binary" when the first 16K is not valid UTF or contains a NUL character.

PS. if adding a flag is something to contemplate, then the name of the flag is critical to avoid confusion. So something like PCRE2_MATCH_VALID_UTF does not reveal that only valid UTF is assumed. Likewise PCRE2_VALID_UTF is not descriptive. Perhaps PCRE2_ONLY_VALID_UTF or PCRE2_ASSUME_VALID_UTF?

@zherczeg
Copy link
Collaborator

Historically PCRE2 only supported valid utf strings. Supporting invalid utf is a relatively new feature. Hence it is hidden behind a flag for compatibility.

@ltrzesniewski
Copy link
Contributor

ltrzesniewski commented Dec 29, 2022

Also, note that you're using pcre2_jit_match, which skips sanity checks, hence the crash.

When using pcre2_match with a JIT-compiled pattern, the library performs a UTF validation pass by default, unless PCRE2_NO_UTF_CHECK is specified.

@genivia-inc
Copy link
Author

@ltrzesniewski Also, note that you're using pcre2_jit_match, which skips sanity checks, hence the crash.

Understood. But I have to nitpick a bit about this though: note that pcre2_jit_match is not mentioned in the Unicode documentation page so it is easily assumed to behave like pcre2_match (nit number 1). And sure, "sanity checks" are mentioned at the pcre2_jit_match documentation page, but this does not cross reference the Unicode page on specifically what that means (nit number 2), i.e. that page does not define what "sanity checks" are, so it is left to the reader to figure out that this can actually crash.

@ltrzesniewski When using pcre2_match with a JIT-compiled pattern, the library performs a UTF validation pass by default, unless PCRE2_NO_UTF_CHECK is specified.

I understand. But if this pass finds invalid UTF then this returns an error code, which is not what we want. Grepping is more of a "brute force thing" perhaps.

@ltrzesniewski
Copy link
Contributor

Indeed, the pcre2_jit_match page only says:

The subject string is not checked for UTF validity.

...but the consequence of that isn't mentioned.

The pcre2jit page gives more details:

Also, unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for validity. In the interests of speed, these checks do not happen on the JIT fast path, and if invalid data is passed, the result is undefined.

The pcre2api page is the most explicit about it:

It may cause your program to crash or loop.

I suppose this scary warning should be added to the other pages as well.

@PhilipHazel
Copy link
Collaborator

I have made a note to update the documentation in due course.

@PhilipHazel
Copy link
Collaborator

I have increased the scary level in several pages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants