Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SubstitutionString can't contain escape sequences \n, \t, etc. during replace #27125

Closed
digital-carver opened this issue May 16, 2018 · 5 comments · Fixed by #30513
Closed

SubstitutionString can't contain escape sequences \n, \t, etc. during replace #27125

digital-carver opened this issue May 16, 2018 · 5 comments · Fixed by #30513
Labels
domain:strings "Strings!" Hacktoberfest Good for Hacktoberfest participants

Comments

@digital-carver
Copy link
Contributor

While trying to create documentation for SubstitutionString (#26497), I found that replace using SubstitutionStrings doesn't work if the substition string has a newline or other escape sequence in it.

julia> msg = "#Hello# from Julia";

julia> re = r"#(.+)# from (?<name>\w+)";

julia> subst = s"FROM: \g<name>\n MESSAGE: \1"
s"FROM: \\g<name>\\n MESSAGE: \\1"

julia> replace(msg, re => subst)
ERROR: Bad replacement string: FROM: \g<name>\n MESSAGE: \1
Stacktrace:
 [1] error at .\error.jl:33 [inlined]
 [2] replace_err at .\regex.jl:241 [inlined]
 [3] _replace(::Base.GenericIOBuffer{Array{UInt8,1}}, ::Base.SubstitutionString{String}, ::String, ::UnitRange{Int64}, ::Regex) at .\regex.jl:302
 [4] #replace#350(::Int64, ::Function, ::String, ::Pair{Regex,Base.SubstitutionString{String}}) at .\strings\util.jl:427
 [5] replace(::String, ::Pair{Regex,Base.SubstitutionString{String}}) at .\strings\util.jl:415
 [6] top-level scope

julia> subst = s"FROM: \g<name>; MESSAGE: \1"
s"FROM: \\g<name>; MESSAGE: \\1"

julia> replace(msg, re => subst) #works fine when \n is removed
"FROM: Julia; MESSAGE: Hello"

Looking at the code, when the _replace function (in regex.jl) sees a backslash, it expects another backslash, a digit or a 'g', and throws an error with replace_err if it doesn't see one of those. Another consequence of this is that the replacement string can't contain things like Unicode codepoint sequences (\u0B85).

@nalimilan
Copy link
Member

The problem is that \n isn't interpreted as in other strings, it's just stored as is:

julia> s"\n".string == "\\n"
true

The same thing applies to regexes, but it's less visible since PCRE interprets escapes like \n. Not sure what we should do.

@nalimilan nalimilan added the domain:strings "Strings!" label May 19, 2018
@digital-carver
Copy link
Contributor Author

How about calling unescape_string on the string first, with a special additional argument to keep capture-related escape sequences as it is (i.e. the \N and \g<> sequences that _replace handles itself). Something like:

--- a/base/strings/io.jl
+++ b/base/strings/io.jl
@@ -316,19 +316,21 @@ end
 General unescaping of traditional C and Unicode escape sequences. Reverse of
 [`escape_string`](@ref).
 """
-unescape_string(s::AbstractString) = sprint(unescape_string, s, sizehint=lastindex(s))
+unescape_string(s::AbstractString, keep_esc::AbstractArray{<:AbstractChar}=Char[]) = sprint(unescape_string, s, keep_esc; sizehint=lastindex(s))
 
 """
     unescape_string(io, str::AbstractString) -> Nothing
 
 Unescapes sequences and prints result to `io`. See also [`escape_string`](@ref).
 """
-function unescape_string(io, s::AbstractString)
+function unescape_string(io, s::AbstractString, keep_esc::AbstractArray{<:AbstractChar}=Char[])
     a = Iterators.Stateful(s)
     for c in a
         if !isempty(a) && c == '\\'
             c = popfirst!(a)
-            if c == 'x' || c == 'u' || c == 'U'
+            if c in keep_esc
+                print(io, '\\', c)
+            elseif c == 'x' || c == 'u' || c == 'U'
                 n = k = 0
                 m = c == 'x' ? 2 :
                     c == 'u' ? 4 : 8
--- a/base/regex.jl
+++ b/base/regex.jl
@@ -254,7 +254,8 @@ function _replace(io, repl_s::SubstitutionString, str, r, re)
     GROUP_CHAR = 'g'
     LBRACKET = '<'
     RBRACKET = '>'
-    repl = repl_s.string
+    keep_esc = [SUB_CHAR, GROUP_CHAR, collect('0':'9')...]
+    repl = unescape_string(repl_s.string, keep_esc)
     i = firstindex(repl)
     e = lastindex(repl)
     while i <= e

There's a change in behaviour due to this though: invalid single letter sequences like \m, \j, etc. are silently replaced by their corresponding character (m, j, ...) by unescape_string, whereas currently _replace emits the "Bad replacement string" error for those. Perhaps unescape_string needs another optional argument to enable a strict mode, or... is there a reason unescape_string is currently not strict about invalid C sequences, while it errors out on invalid Unicode sequences?

@StefanKarpinski
Copy link
Sponsor Member

Spot on: a custom unescaping pass is what is needed to make this work correctly.

@digital-carver
Copy link
Contributor Author

Note for future: even after other escape sequences are allowed in a SubstitutionString, octal sequences still can't be used since \73 refers to the 73rd capture group and not to ';'. Should be documented.

Related, if the user wants to use the second capture group and then have "45" in the output string, there doesn't currently seem to be a way to do that: \245 will try to get the 245th capture group. Perl solves this by allowing \g{2}45 in the substitution string - allowing the \g syntax to be used with capture group numbers also, in addition to with names. Similarly, Julia should probably allow a \g<2>45 syntax for referring to unnamed capture groups by ordinal number.

To avoid ambiguity then, named capture groups should disallow numerical names, like r"(?g<2>\W+)" (which Perl also does).

@vtjnash vtjnash added this to the 1.x milestone Oct 10, 2018
@vtjnash vtjnash added the Hacktoberfest Good for Hacktoberfest participants label Oct 10, 2018
@pocc
Copy link

pocc commented Mar 8, 2019

Noting that a way to get around this for \n is to manually entering a return like so:

julia> print(replace("beautiful_string", r"_"  => s"
       "))
beautiful
string

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:strings "Strings!" Hacktoberfest Good for Hacktoberfest participants
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants