Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

German: Quotation marks not correctly tokenized #596

Closed
jgontrum opened this issue Nov 1, 2016 · 13 comments
Closed

German: Quotation marks not correctly tokenized #596

jgontrum opened this issue Nov 1, 2016 · 13 comments

Comments

@jgontrum
Copy link

jgontrum commented Nov 1, 2016

In some cases, a quotation mark is not separated from the following token.

Example

import spacy
German = spacy.load('de')
analysis = list(German(u'"Ich mag keine Anführungszeichen."'))
print(analysis[0])

=> "Ich

Your Environment

  • Operating System: OSX 10.11.6
  • Python Version Used: 2.7.12
  • spaCy Version Used: 1.1.2
@honnibal
Copy link
Member

honnibal commented Nov 1, 2016

Thanks!

A work-around until this is fixed:

import spacy.de
import spacy

spacy.de.German.Defaults.prefixes += tuple([u'"'])

nlp = spacy.load('de')

I haven't test this yet, but it should work. Basically, spacy.de.German.Defaults.prefixes should hold a tuple of regex-escaped strings. Those strings are then joined into a regular expression.

The source data is in spacy/de/language_data.py, in the TOKENIZER_PREFIXES variable. If you make more improvements, a pull request would be awesome :)

@jgontrum
Copy link
Author

jgontrum commented Nov 2, 2016

Thank you for your quick reply!

Adding the quotation mark to the prefixes solved my problem.

However, I still believe there is something wrong with the TOKENIZER_PREFIXES, since the quotation mark is already on this list, so it should work by default.

Greetings!

@honnibal
Copy link
Member

honnibal commented Nov 2, 2016

Hmm, I assumed it was a different quotation mark that just looked similar. Is it really the same character? If so then yes, something's wrong.

@jgontrum
Copy link
Author

jgontrum commented Nov 2, 2016

It is. I also tested it with the octothorpe that is definitely on the list and "#Ich" is also tokenized as one token.

@honnibal
Copy link
Member

honnibal commented Nov 2, 2016

Hm! Think I see the problem.

honnibal added a commit that referenced this issue Nov 2, 2016
…up the tokenizer regex for non-ascii characters in Python 2. Re Issue #596
@honnibal honnibal closed this as completed Nov 2, 2016
@jgontrum
Copy link
Author

jgontrum commented Nov 3, 2016

Argh, Python 2 and it's encoding issues... Thanks for fixing it so quickly!

@schlichtanders
Copy link

I run into very similar issues with parantheses (, [, and {

import spacy
nlp = spacy.load("de")
for w in nlp('Die Ausstellung (Unsere heimischen Kastanienarten) war wirklich schön.'):
    print(w, w.tag_)

gives

Die ART
Ausstellung NN
(Unsere ADJA
heimischen ADJA
Kastanienarten NN
) $.
war VAFIN
wirklich ADJD
schön ADJD
. $.

while if I add spacy.de.German.Defaults.prefixes += ( '(', )
I get the right cutting.

Die ART
Ausstellung NN
( $(
Unsere PPOSAT
heimischen ADJA
Kastanienarten NN
) $(
war VAFIN
wirklich ADJD
schön ADJD
. $.

Looking at the default prefixes and where they come from (https://github.com/explosion/spaCy/blob/master/spacy/language_data/punctuation.py#L39) it seems like the escaping of the parantheses (and probably more?) causes this problem.
I suppose they where escaped because of the use within pythons re module, but maybe at another place they are used as raw text which would lead to the problems?

@ines
Copy link
Member

ines commented Jan 18, 2017

@schlichtanders Thanks for the report and your analysis – this makes a lot of sense. I'll add a test and take care of it!

@schlichtanders
Copy link

thanks for the immediate reaction.
I am looking forward to it

@ines
Copy link
Member

ines commented Jan 18, 2017

@schlichtanders Hmm, so I haven't managed to reproduce this error. Tested it with both the German model and just the tokenizer, and it's tokenized correctly:

[u'Die', u'Ausstellung', u'(', u'Unsere', u'heimischen', u'Kastanienarten', u')', u'war', u'wirklich', u'sch\xf6n', u'.']
[u'Die', u'Ausstellung', u'[', u'Unsere', u'heimischen', u'Kastanienarten', u']', u'war', u'wirklich', u'sch\xf6n', u'.']
[u'Die', u'Ausstellung', u'{', u'Unsere', u'heimischen', u'Kastanienarten', u'}', u'war', u'wirklich', u'sch\xf6n', u'.']

Which version of spaCy are you using?

Btw, this doesn't actually seem like a problem in your case, but just so you know:
By adding a ( to the prefix defaults, you might accidentally be triggering a temporary hack that automatically escapes all prefixes. In older versions, prefixes were re escaped by the tokenizer, while the suffixes and infixes (which contained regular expressions) weren't and had to be escaped in the tokenizer exceptions. This behaviour was a little confusing and inconsistent, so it was changed to require all escaping in the tokenizer exceptions. In order to still support old unescaped prefix data, we're currently using a little hack that checks whether the prefixes contain an unescaped (. If so, it assumes the data is old and escapes all prefixes.

@schlichtanders
Copy link

I am very sorry.
I now checked my version, and it is indeed spacy 1.2. On Pypi there already is spacy 1.6, I could install (after closing current spacy interactive sessions and switching to windows command line instead of powershell, but this might be local issues for my pc) and it indeed works now out of the box

thanks for coming back to me so quickly

@ines
Copy link
Member

ines commented Jan 18, 2017

No worries – glad to hear it's working! 👍

@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants