parsing ranges problem #123

xlazom00 · 2022-05-30T12:55:50Z

Anybody understand this two ifs?
https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_compile.c#L3559-L3565

if (c == parsed_pattern[-2])       /* Optimize one-char range */
            parsed_pattern--;
else if (parsed_pattern[-2] > c)   /* Check range is in order */
            {
            errorcode = ERR8;
            goto FAILED_BACK;
            }

I have two regexps
(?:[\\xDFFB-\\xDFFE]) can compile
(?:[\\x270A-\\x270D]) can't compile

And it can't compile 2nd as
parsed_pattern[-2] > 0x27
0x41 > 0x27
that 0x41 came from last character from range start 'A' => 0x41
So any idea what this two ifs should handle ?
I can't find anything useful in git history

The text was updated successfully, but these errors were encountered:

xlazom00 · 2022-05-31T07:33:06Z

I think that

if should check only ranges [\xD-\xD] but I am not sure if it is that easy to do that in compiler. As code need to be more general. And not just of one-char. Are there any other layers where pcre2 can handle this more elegantly?

xlazom00 · 2022-05-31T07:34:17Z

and
2. is also same it only can check ranges like [\xD-\xE]

xlazom00 · 2022-05-31T07:34:29Z

So I think this is BUG

ltrzesniewski · 2022-05-31T08:04:39Z

\x expects two subsequent characters. So \x270A is parsed as \x27 followed by 0A. Therefore the range [\x270A-\x270D] is interpreted as [\x27 or 0 or A-\x27 or D], and A-\x27 is not valid.

Try [\x{270A}-\x{270D}], and of course the regex that compiles will need braces as well.

xlazom00 · 2022-05-31T08:21:16Z

@ltrzesniewski
[\x{270A}-\x{270D}] is working fine
but [\x270A-\x270D] is valid

xlazom00 · 2022-05-31T08:48:05Z

Or maybe https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_compile.c#L1973-L1985
should handle more then two hex digits
Any idea why up two hex digits?

ltrzesniewski · 2022-05-31T09:50:49Z

See the doc:

  \xhh       character with hex code hh
  \x{hh..}   character with hex code hh..
[...]

When \x is not followed by {, from zero to two hexadecimal digits are read, but in ALT_BSUX mode \x must be followed by two hexadecimal digits to be recognized as a hexadecimal escape; otherwise it matches a literal "x". Likewise, if \u (in ALT_BSUX mode) is not followed by four hexadecimal digits or (in EXTRA_ALT_BSUX mode) a sequence of hex digits in curly brackets, it matches a literal "u".

This means that [\x270A-\x270D] is not valid in PCRE2.

It's also not valid in Perl:

Similarly, \xnn, where nn are hexadecimal digits, matches the character whose native ordinal is nn. Again, not using exactly two digits is a recipe for disaster, but you can use \x{...} to specify any number of hex digits.

I suppose this is for backwards-compatibility reasons. Unicode didn't exist back when Perl was created, and all codepoints were a single byte.

xlazom00 · 2022-05-31T09:56:14Z

@ltrzesniewski
thx so I am closing this as invalid

PhilipHazel · 2022-05-31T15:40:33Z

Back in the 1960's, when IBM invented bytes, it seemed like a great idea to have "one unit of storage" = "one character", and who needs more than 255 characters? Before IBM's 360 range, computers had all sorts of different lengths of storage unit. The IBM 7090 range had 36-bit words, the Ferranti Atlas used 48 bits (with addressable "half-words" of 24-bits), and the PDP-8, which post-dates the 360 series, had 12-bit words. Memory was small and expensive, so character strings had to be bit-stuffed into words. This was no longer needed when bytes came along, but if they had only chosen 16 rather than 8 bits we might have managed for longer before needing UTF. :-) Back in the early days octal was used to represent numbers when thinking about bits. It has just occurred to me that the move to mostly hexadecimal might have been caused by the hardware unit changing to a multiple of 4 rather than 3 bits. Anyway, it is certainly history and backwards compatibility that led to the current situation.

xlazom00 closed this as completed May 31, 2022

SolitaryGrass mentioned this issue May 31, 2023

internal_dfa_match, a stack overflow occurred due to recursive calls. #258

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parsing ranges problem #123

parsing ranges problem #123

xlazom00 commented May 30, 2022 •

edited

Loading

xlazom00 commented May 31, 2022

xlazom00 commented May 31, 2022

xlazom00 commented May 31, 2022

ltrzesniewski commented May 31, 2022 •

edited

Loading

xlazom00 commented May 31, 2022 •

edited

Loading

xlazom00 commented May 31, 2022 •

edited

Loading

ltrzesniewski commented May 31, 2022 •

edited

Loading

xlazom00 commented May 31, 2022

PhilipHazel commented May 31, 2022

parsing ranges problem #123

parsing ranges problem #123

Comments

xlazom00 commented May 30, 2022 • edited Loading

xlazom00 commented May 31, 2022

xlazom00 commented May 31, 2022

xlazom00 commented May 31, 2022

ltrzesniewski commented May 31, 2022 • edited Loading

xlazom00 commented May 31, 2022 • edited Loading

xlazom00 commented May 31, 2022 • edited Loading

ltrzesniewski commented May 31, 2022 • edited Loading

xlazom00 commented May 31, 2022

PhilipHazel commented May 31, 2022

xlazom00 commented May 30, 2022 •

edited

Loading

ltrzesniewski commented May 31, 2022 •

edited

Loading

xlazom00 commented May 31, 2022 •

edited

Loading

xlazom00 commented May 31, 2022 •

edited

Loading

ltrzesniewski commented May 31, 2022 •

edited

Loading