Skip to content

Commit

Permalink
Allow restricting [:xdigit:] to ASCII for POSIX compatibility
Browse files Browse the repository at this point in the history
  • Loading branch information
carenas committed Oct 4, 2023
1 parent 2fef163 commit e29a775
Show file tree
Hide file tree
Showing 5 changed files with 78 additions and 21 deletions.
7 changes: 4 additions & 3 deletions ChangeLog
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ does the same, except for \u{...}, which is recognized only when
PCRE2_EXTRA_ALT_BSUX is set. This an ECMAScript, non-Perl compatible,
extension, so PCRE2 follows ECMAScript rather than Perl.

31. Applied pull request #300 bu Carlo, which fixes #261. The bug was that
31. Applied pull request #300 by Carlo, which fixes #261. The bug was that
pcre2_match() was not fully resetting all captures that had been set within a
(possibly recursive) subroutine call such as (?3).

Expand All @@ -128,8 +128,9 @@ now matches characters whose general categories are L or N or whose particular
categories are Mn (non-spacing mark) or Pc (combining puntuation). The latter
includes underscore.

33. Changed the meaning of [:digit:] in UCP mode to match Perl. It now also
matches the "fullwidth" versions of the hex digits.
33. Changed the meaning of [:xdigit:] in UCP mode to match Perl. It now also
matches the "fullwidth" versions of the hex digits. Just like it is done for
[:digit:], PCRE2_EXTRA_ASCII_DIGIT can be used to keep this class ASCII only.


Version 10.42 11-December-2022
Expand Down
8 changes: 5 additions & 3 deletions doc/pcre2pattern.3
Original file line number Diff line number Diff line change
Expand Up @@ -1569,9 +1569,11 @@ plus those characters with code points less than 256 that have the S (Symbol)
property.
.TP 10
[:xdigit:]
In addition to the ASCII hexadecimal digits, this also matches the "fullwidth"
versions of those characters, whose Unicode code points start at U+FF10. This
is a change that was made in PCRE release 10.43 for Perl compatibility.
In addition to the ASCII hexadecimal digits, this also matches the "fullwidth"
versions of those characters, whose Unicode code points start at U+FF10. Just
like in the case of [:digit:], the effect of PCRE2_UCP can be negated by
setting the PCRE2_EXTRA_ASCII_DIGIT option. This is a change that was made in
PCRE release 10.43 for Perl compatibility.
.P
The other POSIX classes are unchanged by PCRE2_UCP, and match only characters
with code points less than 256. The effect of PCRE2_UCP on POSIX classes can be
Expand Down
32 changes: 17 additions & 15 deletions src/pcre2_compile.c
Original file line number Diff line number Diff line change
Expand Up @@ -706,6 +706,7 @@ static const char posix_names[] =
static const uint8_t posix_name_lengths[] = {
5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 4, 6, 0 };

#define PC_DIGIT 7
#define PC_GRAPH 8
#define PC_PRINT 9
#define PC_PUNCT 10
Expand All @@ -722,20 +723,20 @@ absolute value of the third field has these meanings: 0 => no tweaking, 1 =>
remove vertical space characters, 2 => remove underscore. */

static const int posix_class_maps[] = {
cbit_word, cbit_digit, -2, /* alpha */
cbit_lower, -1, 0, /* lower */
cbit_upper, -1, 0, /* upper */
cbit_word, -1, 2, /* alnum - word without underscore */
cbit_print, cbit_cntrl, 0, /* ascii */
cbit_space, -1, 1, /* blank - a GNU extension */
cbit_cntrl, -1, 0, /* cntrl */
cbit_digit, -1, 0, /* digit */
cbit_graph, -1, 0, /* graph */
cbit_print, -1, 0, /* print */
cbit_punct, -1, 0, /* punct */
cbit_space, -1, 0, /* space */
cbit_word, -1, 0, /* word - a Perl extension */
cbit_xdigit,-1, 0 /* xdigit */
cbit_word, cbit_digit, -2, /* alpha */
cbit_lower, -1, 0, /* lower */
cbit_upper, -1, 0, /* upper */
cbit_word, -1, 2, /* alnum - word without underscore */
cbit_print, cbit_cntrl, 0, /* ascii */
cbit_space, -1, 1, /* blank - a GNU extension */
cbit_cntrl, -1, 0, /* cntrl */
cbit_digit, -1, 0, /* digit */
cbit_graph, -1, 0, /* graph */
cbit_print, -1, 0, /* print */
cbit_punct, -1, 0, /* punct */
cbit_space, -1, 0, /* space */
cbit_word, -1, 0, /* word - a Perl extension */
cbit_xdigit, -1, 0 /* xdigit */
};

#ifdef SUPPORT_UNICODE
Expand Down Expand Up @@ -3676,7 +3677,8 @@ while (ptr < ptrend)
#ifdef SUPPORT_UNICODE
if ((options & PCRE2_UCP) != 0 &&
(xoptions & PCRE2_EXTRA_ASCII_POSIX) == 0 &&
!(posix_class == 7 && (xoptions & PCRE2_EXTRA_ASCII_DIGIT) != 0))
!((xoptions & PCRE2_EXTRA_ASCII_DIGIT) != 0 &&
(posix_class == PC_DIGIT || posix_class == PC_XDIGIT)))
{
int ptype = posix_substitutes[2*posix_class];
int pvalue = posix_substitutes[2*posix_class + 1];
Expand Down
18 changes: 18 additions & 0 deletions testdata/testinput5
Original file line number Diff line number Diff line change
Expand Up @@ -1234,6 +1234,8 @@

/[[:xdigit:]]/B,ucp

/[[:xdigit:]]/B,ucp,ascii_digit

# Unicode properties for \b and \B

/\b...\B/utf,ucp
Expand Down Expand Up @@ -2445,6 +2447,22 @@
/[[:digit:]]+/utf,ucp,ascii_posix
123\x{660}456

/^[[:xdigit:]]+$/utf,ucp
f0
1A
d\x{ff10}\=no_jit
\x{ff26}8
\= Expect no match
8g\=no_jit

/^[[:xdigit:]]+$/utf,ucp,ascii_digit
f0
1A
\= Expect no match
d\x{xfff0}
\x{ff26}8
8g

/>[[:space:]]+</utf,ucp
>\x{a0} \x{a0}<
>\x{a0}\x{a0}\x{a0}<
Expand Down
34 changes: 34 additions & 0 deletions testdata/testoutput5
Original file line number Diff line number Diff line change
Expand Up @@ -2583,6 +2583,14 @@ No match
End
------------------------------------------------------------------

/[[:xdigit:]]/B,ucp,ascii_digit
------------------------------------------------------------------
Bra
[0-9A-Fa-f]
Ket
End
------------------------------------------------------------------

# Unicode properties for \b and \B

/\b...\B/utf,ucp
Expand Down Expand Up @@ -5384,6 +5392,32 @@ No match
123\x{660}456
0: 123

/^[[:xdigit:]]+$/utf,ucp
f0
0: f0
1A
0: 1A
d\x{ff10}\=no_jit
0: d\x{ff10}
\x{ff26}8
0: \x{ff26}8
\= Expect no match
8g\=no_jit
No match

/^[[:xdigit:]]+$/utf,ucp,ascii_digit
f0
0: f0
1A
0: 1A
\= Expect no match
d\x{xfff0}
No match
\x{ff26}8
No match
8g
No match

/>[[:space:]]+</utf,ucp
>\x{a0} \x{a0}<
0: >\x{a0} \x{a0}<
Expand Down

0 comments on commit e29a775

Please sign in to comment.