Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parsing ranges problem #123

Closed
xlazom00 opened this issue May 30, 2022 · 9 comments
Closed

parsing ranges problem #123

xlazom00 opened this issue May 30, 2022 · 9 comments

Comments

@xlazom00
Copy link

xlazom00 commented May 30, 2022

Anybody understand this two ifs?
https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_compile.c#L3559-L3565

if (c == parsed_pattern[-2])       /* Optimize one-char range */
            parsed_pattern--;
else if (parsed_pattern[-2] > c)   /* Check range is in order */
            {
            errorcode = ERR8;
            goto FAILED_BACK;
            }

I have two regexps
(?:[\\xDFFB-\\xDFFE]) can compile
(?:[\\x270A-\\x270D]) can't compile

And it can't compile 2nd as
parsed_pattern[-2] > 0x27
0x41 > 0x27
that 0x41 came from last character from range start 'A' => 0x41
So any idea what this two ifs should handle ?
I can't find anything useful in git history

@xlazom00
Copy link
Author

I think that

  1. if should check only ranges [\xD-\xD] but I am not sure if it is that easy to do that in compiler. As code need to be more general. And not just of one-char. Are there any other layers where pcre2 can handle this more elegantly?

@xlazom00
Copy link
Author

and
2. is also same it only can check ranges like [\xD-\xE]

@xlazom00
Copy link
Author

So I think this is BUG

@ltrzesniewski
Copy link
Contributor

ltrzesniewski commented May 31, 2022

\x expects two subsequent characters. So \x270A is parsed as \x27 followed by 0A. Therefore the range [\x270A-\x270D] is interpreted as [\x27 or 0 or A-\x27 or D], and A-\x27 is not valid.

Try [\x{270A}-\x{270D}], and of course the regex that compiles will need braces as well.

@xlazom00
Copy link
Author

xlazom00 commented May 31, 2022

@ltrzesniewski
[\x{270A}-\x{270D}] is working fine
but [\x270A-\x270D] is valid

@xlazom00
Copy link
Author

xlazom00 commented May 31, 2022

Or maybe https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_compile.c#L1973-L1985
should handle more then two hex digits
Any idea why up two hex digits?

@ltrzesniewski
Copy link
Contributor

ltrzesniewski commented May 31, 2022

See the doc:

  \xhh       character with hex code hh
  \x{hh..}   character with hex code hh..

[...]

When \x is not followed by {, from zero to two hexadecimal digits are read, but in ALT_BSUX mode \x must be followed by two hexadecimal digits to be recognized as a hexadecimal escape; otherwise it matches a literal "x". Likewise, if \u (in ALT_BSUX mode) is not followed by four hexadecimal digits or (in EXTRA_ALT_BSUX mode) a sequence of hex digits in curly brackets, it matches a literal "u".

This means that [\x270A-\x270D] is not valid in PCRE2.

It's also not valid in Perl:

Similarly, \xnn, where nn are hexadecimal digits, matches the character whose native ordinal is nn. Again, not using exactly two digits is a recipe for disaster, but you can use \x{...} to specify any number of hex digits.

I suppose this is for backwards-compatibility reasons. Unicode didn't exist back when Perl was created, and all codepoints were a single byte.

@xlazom00
Copy link
Author

@ltrzesniewski
thx so I am closing this as invalid

@PhilipHazel
Copy link
Collaborator

Back in the 1960's, when IBM invented bytes, it seemed like a great idea to have "one unit of storage" = "one character", and who needs more than 255 characters? Before IBM's 360 range, computers had all sorts of different lengths of storage unit. The IBM 7090 range had 36-bit words, the Ferranti Atlas used 48 bits (with addressable "half-words" of 24-bits), and the PDP-8, which post-dates the 360 series, had 12-bit words. Memory was small and expensive, so character strings had to be bit-stuffed into words. This was no longer needed when bytes came along, but if they had only chosen 16 rather than 8 bits we might have managed for longer before needing UTF. :-) Back in the early days octal was used to represent numbers when thinking about bits. It has just occurred to me that the move to mostly hexadecimal might have been caused by the hardware unit changing to a multiple of 4 rather than 3 bits. Anyway, it is certainly history and backwards compatibility that led to the current situation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants