Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

match_all function to match all occurrences of a regex pattern in a string #3525

Closed
wants to merge 2 commits into from

Conversation

nutsiepully
Copy link
Contributor

These commits add the ability to match all occurrences of a regex pattern in a given string. (Similar to the perl /g option or the ruby scan)

So, now one can search for all matches as follows.

julia> str = "The dark side of the moon"
"The dark side of the moon"

julia> match_all(r"(\w+)(\s*)", str)
6-element RegexMatch Array:
 RegexMatch("The ", 1="The", 2=" ")
 RegexMatch("dark ", 1="dark", 2=" ")
 ⋮
 RegexMatch("the ", 1="the", 2=" ")
 RegexMatch("moon", 1="moon", 2="")

#Test with Unicode

julia> s = "x \u2200 x \u2203 y"
"x ∀ x ∃ y"

julia> match_all(r".\s", s)
4-element RegexMatch Array:
 RegexMatch("x ")
 RegexMatch("∀ ")
 RegexMatch("x ")
 RegexMatch("∃ ")
#Test with empty strings matches

r = r"a?b?"
match_all(r, "asbds")

julia> match_all(r, "asbd")
5-element RegexMatch Array:
 RegexMatch("a")
 RegexMatch("")
 RegexMatch("b")
 RegexMatch("")
 RegexMatch("")

This is an initial pull request and there are still a couple of things pending -

  1. Support for CRLF
  2. Test code to be put in

@JeffBezanson
Copy link
Sponsor Member

See also #1491

@StefanKarpinski
Copy link
Sponsor Member

Thanks for pushing this forward. The naming convention in Base is to call this matchall. A couple questions. Why isn't this just a matter of doing:

function matchall(re::Regex, str::ByteString)
  matches = RegexMatch[]
  for m in eachmatch(re, str)
    push!(matches, m)
  end
  return matches
end

If there are cases where this doesn't do the same thing, then we need to fix RegexMatchIterator so that it does work and we should add tests for such cases.

@JeffBezanson
Copy link
Sponsor Member

Or [eachmatch(re,str)...].

StefanKarpinski added a commit that referenced this pull request Jun 24, 2013
This also defined eltype for RegexMatchIterators, making it possible
to write collect(eachmatch(re,str)) for matchall(re,str). See #3525.
@StefanKarpinski
Copy link
Sponsor Member

collect(eachmatch(re,str)) now works: 0050ed1.

@nutsiepully
Copy link
Contributor Author

@StefanKarpinski @JeffBezanson - Thanks for the feedback

  1. I wasn't aware of the convention. Will change it to matchall. Out of curiosity, any reason this convention was chosen over match_all or matchAll
  2. yes, there are 3 cases that fail with simple iteration
    • Infinite loop with empty string matches (shown in 3rd test case above)
    • Not recognising UTF strings. simple matchall doesn't skip over the entire UTF character (2nd test case)
    • Not recognising a CRLF sequence. The code should skip over the entire CRLF character. My patch doesn't properly support this yet. The next commit will.

I have most of the test cases, will commit them as well.

Yes we should fix RegexMatchIterator. Idiot that I am, I didn't look at the the iterator properly and went on write the entire function from scratch. I'll try and incorporate the iterator in the commit as well.

Tasks for now

  • Put in CRLF fix
  • Put in test code
  • Restructure the code so that RegexMatchIterator and matchall can both use the same logic.

Does this look ok? Please let me know if any corrections are neded.

@pao
Copy link
Member

pao commented Jun 25, 2013

Out of curiosity, any reason this convention was chosen over match_all or matchAll

See #1539

@StefanKarpinski
Copy link
Sponsor Member

Great. Fixing this behavior for RegexMatchIterator is definitely the way to go here. Then matchall(re,str) can uniformly just be defined as [eachmatch(re,str)...]. You can test that all the hard cases are fixed by using matchall.

@nutsiepully
Copy link
Contributor Author

Sounds good to me. Will fix it and update the PR.
On Jun 25, 2013 7:55 PM, "Stefan Karpinski" notifications@github.com
wrote:

Great. Fixing this behavior for RegexMatchIterator is definitely the way
to go here. Then matchall(re,str) can uniformly just be defined as
[eachmatch(re,str)...]. You can test that all the hard cases are fixed by
using matchall.


Reply to this email directly or view it on GitHubhttps://github.com//pull/3525#issuecomment-19979586
.

@ViralBShah
Copy link
Member

Is the RegexMatchIterator going to be updated in a new PR, or is this no longer needed?

@nutsiepully nutsiepully mentioned this pull request Jun 30, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants