Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perldot: perl compatible matching of "." when using NUL or CR terminated subjects #45

Closed
wants to merge 4 commits into from

Conversation

carenas
Copy link
Contributor

@carenas carenas commented Nov 13, 2021

The first patch could be taken AS-IS, but the rest are only a POC (meant more for design discussion), since they are probably incomplete, missing tests, documentation and will need serious refactoring.

It also has probably questionable names, and might be missing pieces, as I am not familiar with this part of the codebase and I simply mindlessly did the changes that were needed to compile, but it seems to work fine on my tests, and therefore think could do with some peer review.

Fixes: #43

3d80fa4 (Implement PCRE2_NEWLINE_NUL., 2017-05-26) add support for it
but have missed some spots on the documentation.

Add the ones found while reading code and documentation to fix the issue
that will be addressed in the next commit.

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
@carenas carenas changed the title perldot perldot: perl compatible matching of "." when using NUL or CR terminated subjects Nov 13, 2021
@carenas carenas marked this pull request as draft November 13, 2021 21:15
When '.' is used in a regexp, it matches all characters but the
ones defined as line delimiter, which is a configurable set in
PCRE, and with 2 of those sets not including '\n'.

perl allows for a configurable line delimiter string (not a set),
and therefore treats '\n' specially, preventing it for matching
regardless of what the delimiter contains, therefore when PCRE
uses one of those sets without '\n', the matches will differ:

 $ printf 'a\nb' | perl -n0le '/a.b/ or exit 1'; echo $?
 1
 $ printf 'a\nb' | pcre2grep -q -NNUL 'a.b'; echo $?
 0

Since the current behaviour for '.' is historical, a new compile
option has been invented to allow PCRE to match perl's behaviour
as an alternative.

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
Invent a new pattern modifier to allow setting the needed option
at compile time.

Signed-off-by: Carlo Marcelo Arenas Belón <carenas@gmail.com>
@zherczeg
Copy link
Collaborator

This looks very special. Is there a reason for this (other than perl)? I mean what is the typical use case?

@carenas
Copy link
Contributor Author

carenas commented Nov 14, 2021

the "use case" came from an old GNU grep regression report[1] and the corresponding test which I came along while migrating that codebase to PCRE2.

I don't find it very useful either, and indeed prefer PCRE's behaviour but the fact that changing the underlying library and starting to use PCRE2_NEWLINE_NUL (which should help with several other issues) prevents anyone to get the perl behaviour seems like a compatibility bug worth solving IMHO, and of course, as you said it is perl behaviour.

[1] https://lists.gnu.org/archive/html/bug-grep/2015-07/msg00002.html

@zherczeg
Copy link
Collaborator

zherczeg commented Nov 14, 2021

I am sorry, but I don't really understand the bug. The pattern is /(?<=\n\n\n).*/ and grep (2.21) does not match to \n\n\n\nThis line has three blank lines above it. If I try it in PCRE2, it matches without any problem.

@zherczeg
Copy link
Collaborator

Just a thought: .* matches to an empty string, and this might confuse grep internals.

@carenas
Copy link
Contributor Author

carenas commented Nov 14, 2021

I am sorry, but I don't really understand the bug. The pattern is /(?<=\n\n\n).*/ and grep (2.21) does not match to \n\n\n\nThis line has three blank lines above it. If I try it in PCRE2, it matches without any problem.

It should match "anything that is preceded by 3 '\n', and indeed it matches it you happen to have 4; the difference in output between perl and pcre comes from the definition of what matches '.'

perl (and grep, because it is using PCRE*_NEWLINE_LF in a weird way, even when the delimiter was NUL), will ignore that extra '\n', which is what we can't do in PCRE2 unless something like the proposed patch is added.

agree the behaviour on the bug report wasn't correct either, and indeed my suggestion[1] after "fixing" it was to consider the PCRE behaviour (with PCRE2_NEWLINE_NUL) as the correct one, but it will be technically a regression and probably a surprise to people that expect the "P" in PCRE to mean full compatibility with perl.

[1] https://git.savannah.gnu.org/cgit/grep.git/commit/?id=015d028d0598f31d5aa25e5c47dfe8872afb4e6e

@zherczeg
Copy link
Collaborator

I know little about perl, but here is my source code:

$str = "\n\n\n\nThis";
if ($str =~ /(?<=\n\n\n)(.*)/) {
   print "'$1'\n";
}

It prints ''. If $str is "\n\n\nThis" it prints 'This'. This is exactly how pcre2 works.

PCRE2 has a "not empty" flag for matching, would it be possible to use that flag?

@carenas
Copy link
Contributor Author

carenas commented Nov 16, 2021

It prints ''. If $str is "\n\n\nThis" it prints 'This'. This is exactly how pcre2 works.

correct; the "empty match" is because the expression doesn't have the "s modifier" and so can't include '\n' as part of '.' so it gets an empty match that then can't multiply, returning an empty match, which I have to admit I didn't expect.

but that is not the issue discussed here, and indeed when the line delimiter is '\n' perl and PCRE are in perfect alignment, the divergence only occurs when using something that doesn't have '\n' like '\r' or '\0' as a line delimiter.

PCRE2 has a "not empty" flag for matching, would it be possible to use that flag?

it might work for this specific case, but still will fail when the '\n' is embedded in the middle of the subject as shown in the test cases. Any subject that has '\n' won't match, regardless of what the line delimiter is (-0 means NUL for perl)

$ printf "a\nb" | perl -0ne 'print length($1)+1,"\n" if /(a.b)/'
$ printf "a\nb" | pcre2grep -o -NNUL -a '(a.b)' | wc -c
       4
$ printf "a\0b" | perl -0ne 'print length($1)+1,"\n" if /(a.b)/'
$ printf "a\0b" | pcre2grep -o -NNUL -a '(a.b)' | wc -c
       0

@zherczeg
Copy link
Collaborator

In other words, \n should not match to dot, even if the newline type(s) does not include \n? And this behaviour is hidden behind an options flag. Well it should have no effect on jit, since it won't complie anything when not needed, but it looks like bigger impact on the interpreter when the dot is repeated. So from my side this is ok, but lets wait for @PhilipHazel 's opinion.

@PhilipHazel
Copy link
Collaborator

I've not been paying much attention to this discussion (working on another project), but I've now had some thoughts. The Perl man page says this: "To simplify multi-line substitutions, the "." character never matches a newline unless you use the "/s" modifier". The PCRE2 documentation says " a dot in the pattern matches any one character in the subject string except (by default) a character that signifies the end of a line". So, what does Perl mean by "newline"? I found this documentation:

"Perl uses \n to represent the "logical" newline, where what is logical may depend on the platform in use. In MacPerl, \n always means \015. On EBCDIC platforms, \n could be \025 or \045. In DOSish perls, \n usually means \012, but when accessing a file in "text" mode, perl uses the :crlf layer that translates it to (or from) \015\012, depending on whether you're reading or writing. Unix does the same thing on ttys in canonical mode. \015\012 is commonly referred to as CRLF."

My reading of that is that, if running in an environment where, say, CR means "newline", a dot metacharacter would match the character 0x0A and the escape sequence \n would mean 0x0D (which dot would not match). I believe that this matching is the way PCRE2 currently behaves, except that, of course, there are modes where there are several alternative line endings.

In a PCRE2 pattern \n always means 0x0A of course. If Perl is behaving as I think the documentation implies, then \n should mean \0 when -0ne is set, which is why the first example above does not match. I guess I could test...hmm. In my Linux environment, it seems that \n always gives 0x0A, even when -0 or -0x0D is specified.

I am now confused. However, given that PCRE2 has a different approach to newlines than Perl (recognizing more than one type at once, for example), I think the current behaviour makes sense. The documentation could always be improved, of course, especially if we can figure out exactly how Perl behaves, in order to describe the differences.

@carenas
Copy link
Contributor Author

carenas commented Nov 17, 2021

"Perl uses \n to represent the "logical" newline, where what is logical may depend on the platform in use. In MacPerl, \n always means \015. On EBCDIC platforms, \n could be \025 or \045. In DOSish perls, \n usually means \012, but when accessing a file in "text" mode, perl uses the :crlf layer that translates it to (or from) \015\012, depending on whether you're reading or writing. Unix does the same thing on ttys in canonical mode. \015\012 is commonly referred to as CRLF."

This paragraph explains IMHO clearly why in perl '.' never matches '\n'; because the definition of it is done at compile time to match what would be expected of the text files feed to it.

It also explains why -0 doesn't have the same effect on it that the equivalent option in PCRE has, and as you pointed out leads to confusion.

I am now confused. However, given that PCRE2 has a different approach to newlines than Perl (recognizing more than one type at once, for example), I think the current behaviour makes sense. The documentation could always be improved, of course, especially if we can figure out exactly how Perl behaves, in order to describe the differences.

Agree the current approach makes sense and matches the documentation; my concern again is about perl compatibility and the fact that with the current implementation there is just no way to keep the matches consistent between PCRE and perl because of that special place that perl has for '\n'.

I was hoping with this POC to prove that the ammount of code changes required to support it (at least in *NIX) wasn't that big of a maintenance burden IMHO, and doing so will also probably allow GNU grep that has moved to use PCRE2 the opportunity to finally remove that "-Pz is highly experimental and might not do what you expect" note in their documentation with the library support of PCRE2_NEWLINE_NUL.

@zherczeg
Copy link
Collaborator

There is another way to implement this, and that does not need any matching changes: convert dot to [^\0\n] internally during parsing. These conversions can be done by repan ( https://github.com/zherczeg/repan ) easily.

@PhilipHazel
Copy link
Collaborator

I have just committed a couple of documentation updates (pcre2compat and pcre2syntax) that are intended to clearly explain PCRE2's behaviour. I don't think we should make any change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

perl compatible matching of '\n' with '.' when (*NUL)
3 participants