Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfixes for UTF support (WIP) #226

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Conversation

carenas
Copy link
Contributor

@carenas carenas commented Apr 14, 2023

Rename PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES to PCRE2_EXTRA_ALLOW_SURROGATES so it can be used when characters with those codes are used even when not escaped. This also resolves the ambuiguity that allowed the following:

PCRE2 version 10.34 2019-11-21
re> /\p{Cs}/match_invalid_utf
data> \x{dfff}=no_utf_check
No match

Several interrelated bug fixes:

  • An UTF-8 or UTF-32 encoded character with a surrogate codepoint should be allowed with PCRE2_EXTRA_ALLOW_SURROGATES. Before this change non UTF-16 will not allow surrogates if encoded in the pattern directly.
  • In ALT_BSUX mode, any escaped character should represent itself if not "special", but that was not the case for not ASCII characters.
  • When match_invalid_utf is enabled a surrogate character in the subject was being skipped, even if no_utf_check was used together with allow_surrogates

Issues that would be addressed with a new version

  • the documentation around \o and \x, specially when bigger than 255 needs reviewing for clarification.
  • pcre2test support for testing escaped UTF characters is suboptimal, so most testing was done with custom code that should be integrated instead.
  • JIT is broken when match_invalid_utf and allow_surrogate.

Issues that are still open and had been punted from this version:

  • ALT_BSUX allows escape characters up to \x{ffff} even without PCRE2_UTF.
  • PCRE2_ALT_BSUX might assume only UTF-16. It is not clear what 0xf1 might mean (\u00F1, assume binary matching like it seems to do with surrogates and match with an extended 0xf1 PCRE2_UCHAR, or error), if the later is something like PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES needed?.
  • PCRE2_ALT_BSUX match shouldn't really do UTF matching when using surrogates.
  • PCRE2_EXTRA_ALT_BSUX should restrict the use of , so for example non ASCII or \U throw an error. It probably need to break with PCRE2_ALT_SUX.
  • Behaviour is different between UTF-16 and UTF-32 but tests are still unified so tests are missing.

Rename PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES to PCRE2_EXTRA_ALLOW_SURROGATES
so it can be used when characters with those codes are used even when not
escaped.  This also resolves the ambuiguity that allowed the following:

  PCRE2 version 10.34 2019-11-21
    re> /\p{Cs}/match_invalid_utf
  data> \x{dfff}\=no_utf_check
  No match

Several interrelated bug fixes:

* An UTF-8 or UTF-32 encoded character with a surrogate codepoint should be
  allowed with PCRE2_EXTRA_ALLOW_SURROGATES.  Before this change non UTF-16
  will not allow surrogates if encoded in the pattern directly.
* In ALT_BSUX mode, any escaped character should represent itself if not
  "special", but that was not the case for not ASCII characters.
* When match_invalid_utf is enabled a surrogate character in the subject was
  being skipped, even if no_utf_check was used together with allow_surrogates

Issues that would be addressed with a new version

* the documentation around \o and \x, specially when bigger than 255 needs
  reviewing for clarification.
* pcre2test support for testing escaped UTF characters is suboptimal, so most
  testing was done with custom code that should be integrated instead.
* JIT is broken when match_invalid_utf and allow_surrogate.

Issues that are still open and had been punted from this version:

* ALT_BSUX allows escape characters up to \x{ffff} even without PCRE2_UTF.
* PCRE2_ALT_BSUX might assume only UTF-16. It is not clear what 0xf1 might
  mean (\u00F1, assume binary matching like it seems to do with surrogates
  and match with an extended  0xf1 PCRE2_UCHAR, or error), if the later is
  something like PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES needed?.
* PCRE2_ALT_BSUX match shouldn't really do UTF matching when using surrogates.
* PCRE2_EXTRA_ALT_BSUX should restrict the use of \, so for example non ASCII
  or \U throw an error.  It probably need to break with PCRE2_ALT_SUX.
* Behaviour is different between UTF-16 and UTF-32 but tests are still unified
  so tests are missing.
\x{09f}
No match

/^\p{Cs}/match_invalid_utf,allow_surrogates
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this replicates the behaviour where utf is used instead of match_invalid_utf, by using the "allow_surrogates" to redefine what is valid, but maybe it shouldn't match instead since it is invalid UTF-16?

@@ -146,7 +146,7 @@ D is inspected during pcre2_dfa_match() execution

/* An additional compile options word is available in the compile context. */

#define PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES 0x00000001u /* C */
#define PCRE2_EXTRA_ALLOW_SURROGATES 0x00000001u /* C */
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be better to keep the old flag and add PCRE2_EXTRA_ALLOW_SURROGATES, which will imply it and extend it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was already going to say that making the non-compatible change of removing an option is not something I'd want to do, especially as there is a way round it, as you suggest.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and I note that the GitHub tests have failed...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, I posted this as a work in progress, to help gather feedback on the approach and caveats, and a roundabout way to ask Zoltan for help implementing the new behaviour (use a flag to redefine what is invalid utf), so that match_invalid_utf and utf behave similarly and resolve the ambiguity when matching \p{Cs} in the JIT, if it was sound and correct for the interpreter.

Will update the change and make the RFC more explicit.

@carenas carenas changed the title bugfixes for UTF support bugfixes for UTF support (WIP) Apr 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants