perldot: perl compatible matching of "." when using NUL or CR terminated subjects #45

carenas · 2021-11-13T04:39:23Z

The first patch could be taken AS-IS, but the rest are only a POC (meant more for design discussion), since they are probably incomplete, missing tests, documentation and will need serious refactoring.

It also has probably questionable names, and might be missing pieces, as I am not familiar with this part of the codebase and I simply mindlessly did the changes that were needed to compile, but it seems to work fine on my tests, and therefore think could do with some peer review.

Fixes: #43

3d80fa4 (Implement PCRE2_NEWLINE_NUL., 2017-05-26) add support for it but have missed some spots on the documentation. Add the ones found while reading code and documentation to fix the issue that will be addressed in the next commit. Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>

When '.' is used in a regexp, it matches all characters but the ones defined as line delimiter, which is a configurable set in PCRE, and with 2 of those sets not including '\n'. perl allows for a configurable line delimiter string (not a set), and therefore treats '\n' specially, preventing it for matching regardless of what the delimiter contains, therefore when PCRE uses one of those sets without '\n', the matches will differ: $ printf 'a\nb' | perl -n0le '/a.b/ or exit 1'; echo $? 1 $ printf 'a\nb' | pcre2grep -q -NNUL 'a.b'; echo $? 0 Since the current behaviour for '.' is historical, a new compile option has been invented to allow PCRE to match perl's behaviour as an alternative. Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>

Invent a new pattern modifier to allow setting the needed option at compile time. Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>

zherczeg · 2021-11-14T13:34:33Z

This looks very special. Is there a reason for this (other than perl)? I mean what is the typical use case?

carenas · 2021-11-14T17:38:57Z

the "use case" came from an old GNU grep regression report[1] and the corresponding test which I came along while migrating that codebase to PCRE2.

I don't find it very useful either, and indeed prefer PCRE's behaviour but the fact that changing the underlying library and starting to use PCRE2_NEWLINE_NUL (which should help with several other issues) prevents anyone to get the perl behaviour seems like a compatibility bug worth solving IMHO, and of course, as you said it is perl behaviour.

[1] https://lists.gnu.org/archive/html/bug-grep/2015-07/msg00002.html

zherczeg · 2021-11-14T18:41:54Z

I am sorry, but I don't really understand the bug. The pattern is /(?<=\n\n\n).*/ and grep (2.21) does not match to \n\n\n\nThis line has three blank lines above it. If I try it in PCRE2, it matches without any problem.

zherczeg · 2021-11-14T18:47:02Z

Just a thought: .* matches to an empty string, and this might confuse grep internals.

carenas · 2021-11-14T20:22:18Z

I am sorry, but I don't really understand the bug. The pattern is /(?<=\n\n\n).*/ and grep (2.21) does not match to \n\n\n\nThis line has three blank lines above it. If I try it in PCRE2, it matches without any problem.

It should match "anything that is preceded by 3 '\n', and indeed it matches it you happen to have 4; the difference in output between perl and pcre comes from the definition of what matches '.'

perl (and grep, because it is using PCRE*_NEWLINE_LF in a weird way, even when the delimiter was NUL), will ignore that extra '\n', which is what we can't do in PCRE2 unless something like the proposed patch is added.

agree the behaviour on the bug report wasn't correct either, and indeed my suggestion[1] after "fixing" it was to consider the PCRE behaviour (with PCRE2_NEWLINE_NUL) as the correct one, but it will be technically a regression and probably a surprise to people that expect the "P" in PCRE to mean full compatibility with perl.

[1] https://git.savannah.gnu.org/cgit/grep.git/commit/?id=015d028d0598f31d5aa25e5c47dfe8872afb4e6e

zherczeg · 2021-11-15T06:25:51Z

I know little about perl, but here is my source code:

$str = "\n\n\n\nThis";
if ($str =~ /(?<=\n\n\n)(.*)/) {
   print "'$1'\n";
}

It prints ''. If $str is "\n\n\nThis" it prints 'This'. This is exactly how pcre2 works.

PCRE2 has a "not empty" flag for matching, would it be possible to use that flag?

carenas · 2021-11-16T04:48:13Z

It prints ''. If $str is "\n\n\nThis" it prints 'This'. This is exactly how pcre2 works.

correct; the "empty match" is because the expression doesn't have the "s modifier" and so can't include '\n' as part of '.' so it gets an empty match that then can't multiply, returning an empty match, which I have to admit I didn't expect.

but that is not the issue discussed here, and indeed when the line delimiter is '\n' perl and PCRE are in perfect alignment, the divergence only occurs when using something that doesn't have '\n' like '\r' or '\0' as a line delimiter.

PCRE2 has a "not empty" flag for matching, would it be possible to use that flag?

it might work for this specific case, but still will fail when the '\n' is embedded in the middle of the subject as shown in the test cases. Any subject that has '\n' won't match, regardless of what the line delimiter is (-0 means NUL for perl)

$ printf "a\nb" | perl -0ne 'print length($1)+1,"\n" if /(a.b)/'
$ printf "a\nb" | pcre2grep -o -NNUL -a '(a.b)' | wc -c
       4
$ printf "a\0b" | perl -0ne 'print length($1)+1,"\n" if /(a.b)/'
$ printf "a\0b" | pcre2grep -o -NNUL -a '(a.b)' | wc -c
       0

zherczeg · 2021-11-16T06:34:49Z

In other words, \n should not match to dot, even if the newline type(s) does not include \n? And this behaviour is hidden behind an options flag. Well it should have no effect on jit, since it won't complie anything when not needed, but it looks like bigger impact on the interpreter when the dot is repeated. So from my side this is ok, but lets wait for @PhilipHazel 's opinion.

PhilipHazel · 2021-11-16T17:16:08Z

I've not been paying much attention to this discussion (working on another project), but I've now had some thoughts. The Perl man page says this: "To simplify multi-line substitutions, the "." character never matches a newline unless you use the "/s" modifier". The PCRE2 documentation says " a dot in the pattern matches any one character in the subject string except (by default) a character that signifies the end of a line". So, what does Perl mean by "newline"? I found this documentation:

"Perl uses \n to represent the "logical" newline, where what is logical may depend on the platform in use. In MacPerl, \n always means \015. On EBCDIC platforms, \n could be \025 or \045. In DOSish perls, \n usually means \012, but when accessing a file in "text" mode, perl uses the :crlf layer that translates it to (or from) \015\012, depending on whether you're reading or writing. Unix does the same thing on ttys in canonical mode. \015\012 is commonly referred to as CRLF."

My reading of that is that, if running in an environment where, say, CR means "newline", a dot metacharacter would match the character 0x0A and the escape sequence \n would mean 0x0D (which dot would not match). I believe that this matching is the way PCRE2 currently behaves, except that, of course, there are modes where there are several alternative line endings.

In a PCRE2 pattern \n always means 0x0A of course. If Perl is behaving as I think the documentation implies, then \n should mean \0 when -0ne is set, which is why the first example above does not match. I guess I could test...hmm. In my Linux environment, it seems that \n always gives 0x0A, even when -0 or -0x0D is specified.

I am now confused. However, given that PCRE2 has a different approach to newlines than Perl (recognizing more than one type at once, for example), I think the current behaviour makes sense. The documentation could always be improved, of course, especially if we can figure out exactly how Perl behaves, in order to describe the differences.

carenas · 2021-11-17T00:20:34Z

"Perl uses \n to represent the "logical" newline, where what is logical may depend on the platform in use. In MacPerl, \n always means \015. On EBCDIC platforms, \n could be \025 or \045. In DOSish perls, \n usually means \012, but when accessing a file in "text" mode, perl uses the :crlf layer that translates it to (or from) \015\012, depending on whether you're reading or writing. Unix does the same thing on ttys in canonical mode. \015\012 is commonly referred to as CRLF."

This paragraph explains IMHO clearly why in perl '.' never matches '\n'; because the definition of it is done at compile time to match what would be expected of the text files feed to it.

It also explains why -0 doesn't have the same effect on it that the equivalent option in PCRE has, and as you pointed out leads to confusion.

I am now confused. However, given that PCRE2 has a different approach to newlines than Perl (recognizing more than one type at once, for example), I think the current behaviour makes sense. The documentation could always be improved, of course, especially if we can figure out exactly how Perl behaves, in order to describe the differences.

Agree the current approach makes sense and matches the documentation; my concern again is about perl compatibility and the fact that with the current implementation there is just no way to keep the matches consistent between PCRE and perl because of that special place that perl has for '\n'.

I was hoping with this POC to prove that the ammount of code changes required to support it (at least in *NIX) wasn't that big of a maintenance burden IMHO, and doing so will also probably allow GNU grep that has moved to use PCRE2 the opportunity to finally remove that "-Pz is highly experimental and might not do what you expect" note in their documentation with the library support of PCRE2_NEWLINE_NUL.

zherczeg · 2021-11-17T07:05:27Z

There is another way to implement this, and that does not need any matching changes: convert dot to [^\0\n] internally during parsing. These conversions can be done by repan ( https://github.com/zherczeg/repan ) easily.

PhilipHazel · 2021-12-01T16:25:51Z

I have just committed a couple of documentation updates (pcre2compat and pcre2syntax) that are intended to clearly explain PCRE2's behaviour. I don't think we should make any change.

carenas changed the title ~~perldot~~ perldot: perl compatible matching of "." when using NUL or CR terminated subjects Nov 13, 2021

carenas marked this pull request as draft November 13, 2021 21:15

carenas force-pushed the perldot branch from 269f214 to 76b1170 Compare November 14, 2021 07:04

carenas added 3 commits November 14, 2021 00:55

jit: add compile support for OP_ANY_NOTNL

369a4e8

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>

pcre2test: add support for perl compatible . testing

d35cd28

Invent a new pattern modifier to allow setting the needed option at compile time. Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>

carenas force-pushed the perldot branch from 76b1170 to d35cd28 Compare November 14, 2021 10:35

carenas closed this Dec 2, 2021

SolitaryGrass mentioned this pull request May 31, 2023

internal_dfa_match, a stack overflow occurred due to recursive calls. #258

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perldot: perl compatible matching of "." when using NUL or CR terminated subjects #45

perldot: perl compatible matching of "." when using NUL or CR terminated subjects #45

carenas commented Nov 13, 2021 •

edited

Loading

zherczeg commented Nov 14, 2021

carenas commented Nov 14, 2021

zherczeg commented Nov 14, 2021 •

edited

Loading

zherczeg commented Nov 14, 2021

carenas commented Nov 14, 2021 •

edited

Loading

zherczeg commented Nov 15, 2021

carenas commented Nov 16, 2021 •

edited

Loading

zherczeg commented Nov 16, 2021

PhilipHazel commented Nov 16, 2021

carenas commented Nov 17, 2021

zherczeg commented Nov 17, 2021

PhilipHazel commented Dec 1, 2021

perldot: perl compatible matching of "." when using NUL or CR terminated subjects #45

perldot: perl compatible matching of "." when using NUL or CR terminated subjects #45

Conversation

carenas commented Nov 13, 2021 • edited Loading

zherczeg commented Nov 14, 2021

carenas commented Nov 14, 2021

zherczeg commented Nov 14, 2021 • edited Loading

zherczeg commented Nov 14, 2021

carenas commented Nov 14, 2021 • edited Loading

zherczeg commented Nov 15, 2021

carenas commented Nov 16, 2021 • edited Loading

zherczeg commented Nov 16, 2021

PhilipHazel commented Nov 16, 2021

carenas commented Nov 17, 2021

zherczeg commented Nov 17, 2021

PhilipHazel commented Dec 1, 2021

carenas commented Nov 13, 2021 •

edited

Loading

zherczeg commented Nov 14, 2021 •

edited

Loading

carenas commented Nov 14, 2021 •

edited

Loading

carenas commented Nov 16, 2021 •

edited

Loading