PCRE2 JIT crash in 10.38 and 10.42 #180

genivia-inc · 2022-12-28T20:53:13Z

The regex pattern [\w-]+@([\w-]+.)+[\w-]+ causes JIT to crash when matching the pattern against some "binary" data as input.

This was first reported here: Genivia/ugrep#241

I isolated the relevant code in a small C++ file to assist, see attached poc.cpp in the poc.zip. The input file with "binary" data archive2.tgz is also attached in the poc.zip.

The crash is observed with PCRE2 versions 10.38 and 10.42 using UTF-8 matching with code unit width 8. It appears that PCRE2_JIT_PARTIAL_HARD is a potential cause. I've tested in MacOS M1, Android ARM, MacOS Intel. All crash in the JIT code. Matching without JIT works fine.

Hopefully this can be fixed.

poc.zip

The text was updated successfully, but these errors were encountered:

carenas · 2022-12-28T23:14:23Z

when using pcre2 >= 10.34, it is recommended to add PCRE2_MATCH_INVALID_UTF to the options passed to pcre2_compile for this use case, and indeed doing so in your POC prevents the crash.

genivia-inc · 2022-12-29T02:31:23Z

Thanks. A useful suggestion indeed. But I don't recall reading the PCRE2 documentation that PCRE2_MATCH_INVALID_UTF is recommended or necessary, perhaps it didn't jump out to me as important.

carenas · 2022-12-29T03:17:37Z

Agree it is not obvious, but it is at least implied by the documentation when it alludes to the fact that the code expects the subject to be valid UTF and even warns of the possibility of crashes, and also points to that flag as the only way to search within binary files safely (see comment at the end of the page).

zherczeg · 2022-12-29T08:02:12Z

Can we close this issue?

zherczeg · 2022-12-29T11:54:47Z

Note: PCRE2_MATCH_INVALID_UTF has a small perf overhead, so use it only when the subject string may contain invalid utf8/16 characters.

genivia-inc · 2022-12-29T14:35:20Z

@carenas Agree it is not obvious, but it is at least implied by the documentation when it alludes to the fact that the code expects the subject to be valid UTF and even warns of the possibility of crashes, and also points to that flag as the only way to search within binary files safely (see comment at the end of the page).

I would believe that others will run into this problem too eventually, especially if this was changed since 10.34 (I believe, based on comments). This is a situation where the default settings should be safe, rather than the other way round. Better be safe than sorry, as they say. If people want to search UTF-encoded data they fully understand and control/own, then it makes sense NOT to use PCRE2_MATCH_INVALID_UTF. Otherwise, this flag should be the default whenever PCRE2_UTF is used. Perhaps a flag like PCRE2_ONLY_VALID_UTF would be helpful if the default is PCRE2_MATCH_INVALID_UTF, i.e. assuming only valid UTF input to speed up matching? Just my 2 cents.

@zherczeg PCRE2_MATCH_INVALID_UTF has a small perf overhead, so use it only when the subject string may contain invalid utf8/16 characters.

For the ugrep search tool with option -P I will make the PCRE2_MATCH_INVALID_UTF the default when UTF patterns are matched with PCRE2_UTF and PCRE2_UCP (matching ASCII by turning Unicode matching off with -U will not use these flags obviously). There is no way we can guarantee that the input to search is valid UTF (at least not without doing a full pass over it, which is expensive). The ugrep tool does match "binary" files like GNU grep does, and also checks if a file is "binary" when the first 16K is not valid UTF or contains a NUL character.

PS. if adding a flag is something to contemplate, then the name of the flag is critical to avoid confusion. So something like PCRE2_MATCH_VALID_UTF does not reveal that only valid UTF is assumed. Likewise PCRE2_VALID_UTF is not descriptive. Perhaps PCRE2_ONLY_VALID_UTF or PCRE2_ASSUME_VALID_UTF?

zherczeg · 2022-12-29T15:02:21Z

Historically PCRE2 only supported valid utf strings. Supporting invalid utf is a relatively new feature. Hence it is hidden behind a flag for compatibility.

ltrzesniewski · 2022-12-29T15:11:06Z

Also, note that you're using pcre2_jit_match, which skips sanity checks, hence the crash.

When using pcre2_match with a JIT-compiled pattern, the library performs a UTF validation pass by default, unless PCRE2_NO_UTF_CHECK is specified.

genivia-inc · 2022-12-29T16:35:45Z

@ltrzesniewski Also, note that you're using pcre2_jit_match, which skips sanity checks, hence the crash.

Understood. But I have to nitpick a bit about this though: note that pcre2_jit_match is not mentioned in the Unicode documentation page so it is easily assumed to behave like pcre2_match (nit number 1). And sure, "sanity checks" are mentioned at the pcre2_jit_match documentation page, but this does not cross reference the Unicode page on specifically what that means (nit number 2), i.e. that page does not define what "sanity checks" are, so it is left to the reader to figure out that this can actually crash.

@ltrzesniewski When using pcre2_match with a JIT-compiled pattern, the library performs a UTF validation pass by default, unless PCRE2_NO_UTF_CHECK is specified.

I understand. But if this pass finds invalid UTF then this returns an error code, which is not what we want. Grepping is more of a "brute force thing" perhaps.

ltrzesniewski · 2022-12-29T16:57:52Z

Indeed, the pcre2_jit_match page only says:

The subject string is not checked for UTF validity.

...but the consequence of that isn't mentioned.

The pcre2jit page gives more details:

Also, unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for validity. In the interests of speed, these checks do not happen on the JIT fast path, and if invalid data is passed, the result is undefined.

The pcre2api page is the most explicit about it:

It may cause your program to crash or loop.

I suppose this scary warning should be added to the other pages as well.

PhilipHazel · 2022-12-29T17:03:26Z

I have made a note to update the documentation in due course.

PhilipHazel · 2023-01-20T17:09:35Z

I have increased the scary level in several pages.

genivia-inc mentioned this issue Dec 28, 2022

segmentation fault occurs with -P (PCRE2 JIT matching) "\w" regex pattern Genivia/ugrep#241

Closed

genivia-inc closed this as completed Dec 29, 2022

SolitaryGrass mentioned this issue May 31, 2023

internal_dfa_match, a stack overflow occurred due to recursive calls. #258

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PCRE2 JIT crash in 10.38 and 10.42 #180

PCRE2 JIT crash in 10.38 and 10.42 #180

genivia-inc commented Dec 28, 2022

carenas commented Dec 28, 2022 •

edited

Loading

genivia-inc commented Dec 29, 2022

carenas commented Dec 29, 2022

zherczeg commented Dec 29, 2022

zherczeg commented Dec 29, 2022

genivia-inc commented Dec 29, 2022

zherczeg commented Dec 29, 2022

ltrzesniewski commented Dec 29, 2022 •

edited

Loading

genivia-inc commented Dec 29, 2022

ltrzesniewski commented Dec 29, 2022

PhilipHazel commented Dec 29, 2022

PhilipHazel commented Jan 20, 2023

PCRE2 JIT crash in 10.38 and 10.42 #180

PCRE2 JIT crash in 10.38 and 10.42 #180

Comments

genivia-inc commented Dec 28, 2022

carenas commented Dec 28, 2022 • edited Loading

genivia-inc commented Dec 29, 2022

carenas commented Dec 29, 2022

zherczeg commented Dec 29, 2022

zherczeg commented Dec 29, 2022

genivia-inc commented Dec 29, 2022

zherczeg commented Dec 29, 2022

ltrzesniewski commented Dec 29, 2022 • edited Loading

genivia-inc commented Dec 29, 2022

ltrzesniewski commented Dec 29, 2022

PhilipHazel commented Dec 29, 2022

PhilipHazel commented Jan 20, 2023

carenas commented Dec 28, 2022 •

edited

Loading

ltrzesniewski commented Dec 29, 2022 •

edited

Loading