Skip to content

Commit

Permalink
* Fix infixed commas in tokenizer, re Issue #326. Need to benchmark o…
Browse files Browse the repository at this point in the history
…n empirical data, to make sure this doesn't break other cases.
  • Loading branch information
honnibal committed Apr 14, 2016
1 parent 0f957dd commit 6f82065
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 0 deletions.
1 change: 1 addition & 0 deletions lang_data/en/infix.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@
(?<=[a-zA-Z])-(?=[a-zA-z])
(?<=[a-zA-Z])--(?=[a-zA-z])
(?<=[0-9])-(?=[0-9])
(?<=[A-Za-z]),(?=[A-Za-z])
7 changes: 7 additions & 0 deletions spacy/tests/tokenizer/test_infix.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,3 +47,10 @@ def test_double_hyphen(en_tokenizer):
assert tokens[8].text == u'--'
assert tokens[9].text == u'people'


def test_infix_comma(en_tokenizer):
# Re issue #326
tokens = en_tokenizer(u'Hello,world')
assert tokens[0].text == u'Hello'
assert tokens[1].text == u','
assert tokens[2].text == u'world'

0 comments on commit 6f82065

Please sign in to comment.