Tokenization issues #326

mfelice · 2016-04-09T20:45:36Z

Tokenization seems incorrect in a number of cases:

Tokens incorrectly include punctuation at the beginning or in the middle. Punctuation at the end seems to be handled correctly, though. E.g.

Hello,world is currently kept as one token but should be Hello , world
.,;:hello:!.world is currently kept as one token but should be . , ; : hello : ! . world

The dot seems to cause particular problems at the beginning of a token:

.Hello world. gives .Hello world . (but should be . Hello world .).

I suppose dots are preserved as part of a token in case this make up an acronym, but they should not be allowed at the beginning. Basically, no punctuation should be allowed at the beginning, middle or end, except hyphens/dashes/en-dashes in the middle for compounds (as pointed out in #302) and dots for acronyms (in the middle or end).

Related to the above and following up from #325, there should be some disambiguation to determine whether a dot is a full stop or part of an acronym/abbreviation when it appears at the end. Maybe check if the token has some other dot already? E.g.

a.m. > a.m.
CIA. > CIA .
K.G.B. > K.G.B.
.A. > . A .
.AB. > . AB .
.AB.C > . AB . C
.AB.C. > . AB.C.

Something like E.ON (the energy supplier) would cause trouble, but it would be a rare exception (in fact, it should be E·ON).

Related to 2) and #302, you should allow any number of hyphens/dashes/en-dashes in tokens.

next-of-kin is currently next - of-kin
three-year-old is currently three - year-old
jack-in-the-box is currently jack - in-the-box

But they should be one word. The third case is particularly interesting, as it generates a token with more than one hyphen (in-the-box). Clearly, the tokenizer seems to split only on the first hyphen.

The word cannot is currently tokenized as can not. Strict grammarians would say there is a difference between these two forms, so cannot should not be tokenized as can not. I understand spaCy might not want to make this distinction, in which case I wonder how I can force the tokenizer to keep cannot as one word without modifying any files. Ideally, I'd like to add this exception dynamically while/after loading spaCy.

Thank you.

The text was updated successfully, but these errors were encountered:

…n empirical data, to make sure this doesn't break other cases.

honnibal · 2016-04-14T09:41:27Z

Thanks, am thinking these through.

Currently the tokenizer is fairly conservative in segmentation --- it tends to under segment, rather than over segment. I think we should rather switch to often over segmenting, and then use the .merge() function to merge numeric entities, dates, emails, urls etc back into single tokens.

This sort of change takes some experimentation, though. It's at least partly an empirical question, because it's not easy to intuit what cases are common. I'll keep this ticket open and update when I've had a chance to experiment.

henningko · 2016-04-14T13:38:35Z

Are you aware of any quick fix for (3)?

mastasky · 2016-08-05T09:45:05Z

@honnibal : I found another tokenization issue yesterday that was doing my head in. Possibly it's already mentioned in the above.

Turn on the tv. = turn on the tv . (correct)
Turn on the TV. = turn on the TV. (the trailing dot is made part of POBJ)

ines · 2017-01-09T00:15:54Z

This issue should be fixed with the recent updates to the language data.

Re 1./2./3. Hello,world and similar tokens, uppercase abbreviations (K.G.B., E.ON, TV) and common exceptions (a.m.) are now handled correctly. When it comes to unexpected input like .,;:hello:!.world or even .AB.C., we want to stay conservative in segmenting the punctuation.

Re 4. The inconsistency should now be fixed – unless an exception is added, all infix hyphens are split. By default, all tokens are handled this way. If you want to add custom tokenization rules, for example to keep next-of-kin as one token, you can whitelist specific words, or override the default rules with your own regular expressions.

Re 5. To stay consistent with the parser training data, spaCy follows the Penn Treebank tokenization scheme, which splits cannot into two tokens. This behaviour can be modified via the tokenizer exceptions, though.

nelson-liu · 2017-03-13T17:45:19Z

Sorry to bump this thread, but it seems like the special cases for english (e.g. Mr.) do not work properly in a lowercase setting.

In [1]: import spacy

In [2]: en_nlp = spacy.load('en')

In [3]: [str(token) for token in en_nlp.tokenizer("Mr. Smith says hello.")]
Out[3]: ['Mr.', 'Smith', 'says', 'hello', '.']

In [4]: [str(token) for token in en_nlp.tokenizer("Mr. Smith says hello.".lower())]
Out[4]: ['mr', '.', 'smith', 'says', 'hello', '.']

I am aware that I could just add some exceptions, but I don't think I could catch them all; I was wondering if there's any quick fix on your side

lock · 2018-05-09T02:38:27Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added a commit that referenced this issue Apr 14, 2016

* Fix infixed commas in tokenizer, re Issue #326. Need to benchmark o…

6f82065

…n empirical data, to make sure this doesn't break other cases.

honnibal added enhancement Feature requests and improvements performance labels Sep 21, 2016

ines mentioned this issue Oct 22, 2016

💫 Document workflow: Customising the tokenizer #557

Closed

ines added the lang / en English language data and models label Jan 9, 2017

ines closed this as completed Jan 9, 2017

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization issues #326

Tokenization issues #326

mfelice commented Apr 9, 2016

honnibal commented Apr 14, 2016

henningko commented Apr 14, 2016

mastasky commented Aug 5, 2016 •

edited

Loading

ines commented Jan 9, 2017

nelson-liu commented Mar 13, 2017

lock bot commented May 9, 2018

Tokenization issues #326

Tokenization issues #326

Comments

mfelice commented Apr 9, 2016

honnibal commented Apr 14, 2016

henningko commented Apr 14, 2016

mastasky commented Aug 5, 2016 • edited Loading

ines commented Jan 9, 2017

nelson-liu commented Mar 13, 2017

lock bot commented May 9, 2018

mastasky commented Aug 5, 2016 •

edited

Loading