bugfixes for UTF support (WIP) #226

carenas · 2023-04-14T13:30:45Z

Rename PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES to PCRE2_EXTRA_ALLOW_SURROGATES so it can be used when characters with those codes are used even when not escaped. This also resolves the ambuiguity that allowed the following:

PCRE2 version 10.34 2019-11-21
re> /\p{Cs}/match_invalid_utf
data> \x{dfff}=no_utf_check
No match

Several interrelated bug fixes:

An UTF-8 or UTF-32 encoded character with a surrogate codepoint should be allowed with PCRE2_EXTRA_ALLOW_SURROGATES. Before this change non UTF-16 will not allow surrogates if encoded in the pattern directly.
In ALT_BSUX mode, any escaped character should represent itself if not "special", but that was not the case for not ASCII characters.
When match_invalid_utf is enabled a surrogate character in the subject was being skipped, even if no_utf_check was used together with allow_surrogates

Issues that would be addressed with a new version

the documentation around \o and \x, specially when bigger than 255 needs reviewing for clarification.
pcre2test support for testing escaped UTF characters is suboptimal, so most testing was done with custom code that should be integrated instead.
JIT is broken when match_invalid_utf and allow_surrogate.

Issues that are still open and had been punted from this version:

ALT_BSUX allows escape characters up to \x{ffff} even without PCRE2_UTF.
PCRE2_ALT_BSUX might assume only UTF-16. It is not clear what 0xf1 might mean (\u00F1, assume binary matching like it seems to do with surrogates and match with an extended 0xf1 PCRE2_UCHAR, or error), if the later is something like PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES needed?.
PCRE2_ALT_BSUX match shouldn't really do UTF matching when using surrogates.
PCRE2_EXTRA_ALT_BSUX should restrict the use of , so for example non ASCII or \U throw an error. It probably need to break with PCRE2_ALT_SUX.
Behaviour is different between UTF-16 and UTF-32 but tests are still unified so tests are missing.

Rename PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES to PCRE2_EXTRA_ALLOW_SURROGATES so it can be used when characters with those codes are used even when not escaped. This also resolves the ambuiguity that allowed the following: PCRE2 version 10.34 2019-11-21 re> /\p{Cs}/match_invalid_utf data> \x{dfff}\=no_utf_check No match Several interrelated bug fixes: * An UTF-8 or UTF-32 encoded character with a surrogate codepoint should be allowed with PCRE2_EXTRA_ALLOW_SURROGATES. Before this change non UTF-16 will not allow surrogates if encoded in the pattern directly. * In ALT_BSUX mode, any escaped character should represent itself if not "special", but that was not the case for not ASCII characters. * When match_invalid_utf is enabled a surrogate character in the subject was being skipped, even if no_utf_check was used together with allow_surrogates Issues that would be addressed with a new version * the documentation around \o and \x, specially when bigger than 255 needs reviewing for clarification. * pcre2test support for testing escaped UTF characters is suboptimal, so most testing was done with custom code that should be integrated instead. * JIT is broken when match_invalid_utf and allow_surrogate. Issues that are still open and had been punted from this version: * ALT_BSUX allows escape characters up to \x{ffff} even without PCRE2_UTF. * PCRE2_ALT_BSUX might assume only UTF-16. It is not clear what 0xf1 might mean (\u00F1, assume binary matching like it seems to do with surrogates and match with an extended 0xf1 PCRE2_UCHAR, or error), if the later is something like PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES needed?. * PCRE2_ALT_BSUX match shouldn't really do UTF matching when using surrogates. * PCRE2_EXTRA_ALT_BSUX should restrict the use of \, so for example non ASCII or \U throw an error. It probably need to break with PCRE2_ALT_SUX. * Behaviour is different between UTF-16 and UTF-32 but tests are still unified so tests are missing.

carenas · 2023-04-15T06:55:17Z

testdata/testoutput5

+    \x{09f}
+No match
+
+/^\p{Cs}/match_invalid_utf,allow_surrogates


this replicates the behaviour where utf is used instead of match_invalid_utf, by using the "allow_surrogates" to redefine what is valid, but maybe it shouldn't match instead since it is invalid UTF-16?

carenas · 2023-04-15T06:57:50Z

src/pcre2.h.in

@@ -146,7 +146,7 @@ D   is inspected during pcre2_dfa_match() execution

 /* An additional compile options word is available in the compile context. */

-#define PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES  0x00000001u  /* C */
+#define PCRE2_EXTRA_ALLOW_SURROGATES         0x00000001u  /* C */


It might be better to keep the old flag and add PCRE2_EXTRA_ALLOW_SURROGATES, which will imply it and extend it?

Yes, I was already going to say that making the non-compatible change of removing an option is not something I'd want to do, especially as there is a way round it, as you suggest.

... and I note that the GitHub tests have failed...

correct, I posted this as a work in progress, to help gather feedback on the approach and caveats, and a roundabout way to ask Zoltan for help implementing the new behaviour (use a flag to redefine what is invalid utf), so that match_invalid_utf and utf behave similarly and resolve the ambiguity when matching \p{Cs} in the JIT, if it was sound and correct for the interpreter.

Will update the change and make the RFC more explicit.

src/pcre2_intmodedep.h

src/pcre2_match.c

carenas commented Apr 15, 2023

View reviewed changes

src/pcre2_intmodedep.h Show resolved Hide resolved

carenas commented Apr 15, 2023

View reviewed changes

src/pcre2_match.c Show resolved Hide resolved

carenas changed the title ~~bugfixes for UTF support~~ bugfixes for UTF support (WIP) Apr 15, 2023

SolitaryGrass mentioned this pull request May 31, 2023

internal_dfa_match, a stack overflow occurred due to recursive calls. #258

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bugfixes for UTF support (WIP) #226

bugfixes for UTF support (WIP) #226

carenas commented Apr 14, 2023

carenas Apr 15, 2023

carenas Apr 15, 2023

PhilipHazel Apr 15, 2023

PhilipHazel Apr 15, 2023

carenas Apr 15, 2023

bugfixes for UTF support (WIP) #226

Are you sure you want to change the base?

bugfixes for UTF support (WIP) #226

Conversation

carenas commented Apr 14, 2023

carenas Apr 15, 2023

Choose a reason for hiding this comment

carenas Apr 15, 2023

Choose a reason for hiding this comment

PhilipHazel Apr 15, 2023

Choose a reason for hiding this comment

PhilipHazel Apr 15, 2023

Choose a reason for hiding this comment

carenas Apr 15, 2023

Choose a reason for hiding this comment