Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenization issues #326

Closed
mfelice opened this issue Apr 9, 2016 · 6 comments
Closed

Tokenization issues #326

mfelice opened this issue Apr 9, 2016 · 6 comments
Labels
enhancement Feature requests and improvements lang / en English language data and models

Comments

@mfelice
Copy link

mfelice commented Apr 9, 2016

Tokenization seems incorrect in a number of cases:

  1. Tokens incorrectly include punctuation at the beginning or in the middle. Punctuation at the end seems to be handled correctly, though. E.g.

Hello,world is currently kept as one token but should be Hello , world
.,;:hello:!.world is currently kept as one token but should be . , ; : hello : ! . world

  1. The dot seems to cause particular problems at the beginning of a token:

.Hello world. gives .Hello world . (but should be . Hello world .).

I suppose dots are preserved as part of a token in case this make up an acronym, but they should not be allowed at the beginning. Basically, no punctuation should be allowed at the beginning, middle or end, except hyphens/dashes/en-dashes in the middle for compounds (as pointed out in #302) and dots for acronyms (in the middle or end).

  1. Related to the above and following up from #325, there should be some disambiguation to determine whether a dot is a full stop or part of an acronym/abbreviation when it appears at the end. Maybe check if the token has some other dot already? E.g.

a.m. > a.m.
CIA. > CIA .
K.G.B. > K.G.B.
.A. > . A .
.AB. > . AB .
.AB.C > . AB . C
.AB.C. > . AB.C.

Something like E.ON (the energy supplier) would cause trouble, but it would be a rare exception (in fact, it should be E·ON).

  1. Related to 2) and #302, you should allow any number of hyphens/dashes/en-dashes in tokens.

next-of-kin is currently next - of-kin
three-year-old is currently three - year-old
jack-in-the-box is currently jack - in-the-box

But they should be one word. The third case is particularly interesting, as it generates a token with more than one hyphen (in-the-box). Clearly, the tokenizer seems to split only on the first hyphen.

  1. The word cannot is currently tokenized as can not. Strict grammarians would say there is a difference between these two forms, so cannot should not be tokenized as can not. I understand spaCy might not want to make this distinction, in which case I wonder how I can force the tokenizer to keep cannot as one word without modifying any files. Ideally, I'd like to add this exception dynamically while/after loading spaCy.

Thank you.

honnibal added a commit that referenced this issue Apr 14, 2016
…n empirical data, to make sure this doesn't break other cases.
@honnibal
Copy link
Member

Thanks, am thinking these through.

Currently the tokenizer is fairly conservative in segmentation --- it tends to under segment, rather than over segment. I think we should rather switch to often over segmenting, and then use the .merge() function to merge numeric entities, dates, emails, urls etc back into single tokens.

This sort of change takes some experimentation, though. It's at least partly an empirical question, because it's not easy to intuit what cases are common. I'll keep this ticket open and update when I've had a chance to experiment.

@henningko
Copy link

Are you aware of any quick fix for (3)?

@mastasky
Copy link

mastasky commented Aug 5, 2016

@honnibal : I found another tokenization issue yesterday that was doing my head in. Possibly it's already mentioned in the above.

Turn on the tv. = turn on the tv . (correct)
Turn on the TV. = turn on the TV. (the trailing dot is made part of POBJ)

@honnibal honnibal added enhancement Feature requests and improvements performance labels Sep 21, 2016
@ines ines added the lang / en English language data and models label Jan 9, 2017
@ines
Copy link
Member

ines commented Jan 9, 2017

This issue should be fixed with the recent updates to the language data.

Re 1./2./3. Hello,world and similar tokens, uppercase abbreviations (K.G.B., E.ON, TV) and common exceptions (a.m.) are now handled correctly. When it comes to unexpected input like .,;:hello:!.world or even .AB.C., we want to stay conservative in segmenting the punctuation.

Re 4. The inconsistency should now be fixed – unless an exception is added, all infix hyphens are split. By default, all tokens are handled this way. If you want to add custom tokenization rules, for example to keep next-of-kin as one token, you can whitelist specific words, or override the default rules with your own regular expressions.

Re 5. To stay consistent with the parser training data, spaCy follows the Penn Treebank tokenization scheme, which splits cannot into two tokens. This behaviour can be modified via the tokenizer exceptions, though.

@ines ines closed this as completed Jan 9, 2017
@nelson-liu
Copy link

Sorry to bump this thread, but it seems like the special cases for english (e.g. Mr.) do not work properly in a lowercase setting.

In [1]: import spacy

In [2]: en_nlp = spacy.load('en')

In [3]: [str(token) for token in en_nlp.tokenizer("Mr. Smith says hello.")]
Out[3]: ['Mr.', 'Smith', 'says', 'hello', '.']

In [4]: [str(token) for token in en_nlp.tokenizer("Mr. Smith says hello.".lower())]
Out[4]: ['mr', '.', 'smith', 'says', 'hello', '.']

I am aware that I could just add some exceptions, but I don't think I could catch them all; I was wondering if there's any quick fix on your side

@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement Feature requests and improvements lang / en English language data and models
Projects
None yet
Development

No branches or pull requests

6 participants