Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex with a " in character class #28261

Open
mortenpi opened this issue Jul 24, 2018 · 12 comments
Open

Regex with a " in character class #28261

mortenpi opened this issue Jul 24, 2018 · 12 comments
Milestone

Comments

@mortenpi
Copy link
Contributor

match(r"[\\\"]", "\\") does not match on 0.7 while match(r"[\"\\]", "\\") is fine. Unless I am messing up with escape sequences here somehow, either should match as far as I can tell.

Tried on 0.7.0-beta2.83 and 0.7.0-beta2.0. It works fine on 0.6, so looks like a regression.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jul 24, 2018

not a bug, but the intended consequence of #22926

In regex, these two are equivalent (and non-matching), since regexes [\"] and ["] are identical:

julia> match(r"[\"]", "\\")

julia> match(r"[\\\"]", "\\")

On v0.7, what you want is the regex [\\"] (or, equivalently, [\\\"]):

julia> match(r"[\\\\\"]", "\\")
RegexMatch("\\")

But it looks like we're not printing the regex string correctly (it doesn't round-trip properly), which probably makes this even more confusing. I've got some of a fix for that locally and I'll continue working on cleaning that up and getting it into a PR.

@mortenpi
Copy link
Contributor Author

mortenpi commented Jul 24, 2018

On v0.7, what you want is the regex [\\"]

That's what I assumed the first one to be, since I got the impression that backslashes don't need escaping in @r_str (my initial regex was more like r"\\([\\\"])", and the initial \\ got me confused). The @r_str macro docstring is also outdated then if I understand correctly?

Construct a regex, such as r"^[a-z]*$", without interpolation and unescaping (except for quotation mark " which still has to be escaped).

It implies to me that the only escape sequence is \".


Edit: Having read through #22926 once more now -- OK, it is complicated, but it makes sense and there is no better way. And I see that @raw_str has it documented. But it might make sense to have the same comment in the @r_str docstring too (or a reference), since I wouldn't have thought about checking the @raw_str one for this. And maybe on the strings page too?

@JeffBezanson JeffBezanson added the domain:docs This change adds or pertains to documentation label Jul 25, 2018
@StefanKarpinski
Copy link
Sponsor Member

I think we need to rework a lot of the escaping behavior in custom string literals after that change. This means r"..." and s"..." in particular. This is fairly annoying and tedious to get right, unfortunately.

@wdebeaum
Copy link

#28175 seems related.

@StefanKarpinski
Copy link
Sponsor Member

I think we need to mark this as unstable in 1.0 because it needs to be fixed; that can be justified that it's a bugfix but it will likely break some code when we fix it.

@StefanKarpinski StefanKarpinski added the kind:bug Indicates an unexpected problem or unintended behavior label Jul 27, 2018
@StefanKarpinski StefanKarpinski added this to the 1.1 milestone Jul 27, 2018
@JeffBezanson
Copy link
Sponsor Member

I'm not sure how to fix this --- the escaping rules happen in the parser before we get to the custom string literal code, since otherwise we don't know when the string ends.

@StefanKarpinski
Copy link
Sponsor Member

Yeah, I'll have to look at it for a while and think; it may not be fixable.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jul 27, 2018

I think it's just a printing issue, since it doesn't round-trip correctly. We can't to anything about the escaping rules, unless we want there to be a subset of regexes which cannot be expressed using the string macro (as was true in v0.6 and prior).

@StefanKarpinski
Copy link
Sponsor Member

StefanKarpinski commented Jul 27, 2018

If it's just a printing issue, that's even better since fixing it won't break people's code.

@mortenpi
Copy link
Contributor Author

As far as I can tell the issue here is that the escaping logic is a bit counter-intuitive? But it is working as designed (except for the printing?). And, as was discussed at length in #22926, there might not really be any better solution.

unless we want there to be a subset of regexes which cannot be expressed using the string macro (as was true in v0.6 and prior).

FWIW, one of the proposals in #22926 only excluded trailing backslashes, which would not be a problem for regular expressions (and latex), since they are not allowed anyway.

@wdebeaum
Copy link

I'm sorry, what's an example of a regex that couldn't be expressed using the string macro in 0.6? I understand some strings couldn't be expressed with the raw string macro, but strings and regexes aren't 1:1. PCRE treats \" the same as ", so e.g. Regex("a\\\"b") matches the same as Regex("a\"b").

@vtjnash
Copy link
Sponsor Member

vtjnash commented Jul 27, 2018

The sequence \" (Regex("\\\"")) could not previously be given. r"\"" would give " (Regex("\"")), while r"\\\"" would give \\" (Regex("\\\\\"")). You're right that there's probably enough redundancy in regex that this may be immaterial (since 2n and 2n+1 \ before " should give the same matches). If we want, we could precisely restore the old behavior by doubling the number of \ that immediately precede any " character or the end of the string.

@JeffBezanson JeffBezanson modified the milestones: 1.1, 1.2 Dec 6, 2018
@KristofferC KristofferC modified the milestones: 1.2, 1.x Apr 17, 2019
@vtjnash vtjnash removed kind:bug Indicates an unexpected problem or unintended behavior domain:docs This change adds or pertains to documentation labels Feb 4, 2021
@vtjnash vtjnash modified the milestones: 1.x, 2.0 Feb 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants