German: Quotation marks not correctly tokenized #596

jgontrum · 2016-11-01T16:58:58Z

In some cases, a quotation mark is not separated from the following token.

Example

import spacy
German = spacy.load('de')
analysis = list(German(u'"Ich mag keine Anführungszeichen."'))
print(analysis[0])

=> "Ich

Your Environment

Operating System: OSX 10.11.6
Python Version Used: 2.7.12
spaCy Version Used: 1.1.2

honnibal · 2016-11-01T17:13:03Z

Thanks!

A work-around until this is fixed:

import spacy.de
import spacy

spacy.de.German.Defaults.prefixes += tuple([u'"'])

nlp = spacy.load('de')

I haven't test this yet, but it should work. Basically, spacy.de.German.Defaults.prefixes should hold a tuple of regex-escaped strings. Those strings are then joined into a regular expression.

The source data is in spacy/de/language_data.py, in the TOKENIZER_PREFIXES variable. If you make more improvements, a pull request would be awesome :)

jgontrum · 2016-11-02T11:04:32Z

Thank you for your quick reply!

Adding the quotation mark to the prefixes solved my problem.

However, I still believe there is something wrong with the TOKENIZER_PREFIXES, since the quotation mark is already on this list, so it should work by default.

Greetings!

honnibal · 2016-11-02T11:21:29Z

Hmm, I assumed it was a different quotation mark that just looked similar. Is it really the same character? If so then yes, something's wrong.

jgontrum · 2016-11-02T11:23:38Z

It is. I also tested it with the octothorpe that is definitely on the list and "#Ich" is also tokenized as one token.

honnibal · 2016-11-02T11:30:39Z

Hm! Think I see the problem.

…up the tokenizer regex for non-ascii characters in Python 2. Re Issue #596

jgontrum · 2016-11-03T09:45:49Z

Argh, Python 2 and it's encoding issues... Thanks for fixing it so quickly!

schlichtanders · 2017-01-18T15:29:17Z

I run into very similar issues with parantheses (, [, and {

import spacy
nlp = spacy.load("de")
for w in nlp('Die Ausstellung (Unsere heimischen Kastanienarten) war wirklich schön.'):
    print(w, w.tag_)

gives

Die ART
Ausstellung NN
(Unsere ADJA
heimischen ADJA
Kastanienarten NN
) $.
war VAFIN
wirklich ADJD
schön ADJD
. $.

while if I add spacy.de.German.Defaults.prefixes += ( '(', )
I get the right cutting.

Die ART
Ausstellung NN
( $(
Unsere PPOSAT
heimischen ADJA
Kastanienarten NN
) $(
war VAFIN
wirklich ADJD
schön ADJD
. $.

Looking at the default prefixes and where they come from (https://github.com/explosion/spaCy/blob/master/spacy/language_data/punctuation.py#L39) it seems like the escaping of the parantheses (and probably more?) causes this problem.
I suppose they where escaped because of the use within pythons re module, but maybe at another place they are used as raw text which would lead to the problems?

ines · 2017-01-18T15:55:13Z

@schlichtanders Thanks for the report and your analysis – this makes a lot of sense. I'll add a test and take care of it!

schlichtanders · 2017-01-18T15:57:55Z

thanks for the immediate reaction.
I am looking forward to it

ines · 2017-01-18T16:26:57Z

@schlichtanders Hmm, so I haven't managed to reproduce this error. Tested it with both the German model and just the tokenizer, and it's tokenized correctly:

[u'Die', u'Ausstellung', u'(', u'Unsere', u'heimischen', u'Kastanienarten', u')', u'war', u'wirklich', u'sch\xf6n', u'.']
[u'Die', u'Ausstellung', u'[', u'Unsere', u'heimischen', u'Kastanienarten', u']', u'war', u'wirklich', u'sch\xf6n', u'.']
[u'Die', u'Ausstellung', u'{', u'Unsere', u'heimischen', u'Kastanienarten', u'}', u'war', u'wirklich', u'sch\xf6n', u'.']

Which version of spaCy are you using?

Btw, this doesn't actually seem like a problem in your case, but just so you know:
By adding a ( to the prefix defaults, you might accidentally be triggering a temporary hack that automatically escapes all prefixes. In older versions, prefixes were re escaped by the tokenizer, while the suffixes and infixes (which contained regular expressions) weren't and had to be escaped in the tokenizer exceptions. This behaviour was a little confusing and inconsistent, so it was changed to require all escaping in the tokenizer exceptions. In order to still support old unescaped prefix data, we're currently using a little hack that checks whether the prefixes contain an unescaped (. If so, it assumes the data is old and escapes all prefixes.

schlichtanders · 2017-01-18T16:42:22Z

I am very sorry.
I now checked my version, and it is indeed spacy 1.2. On Pypi there already is spacy 1.6, I could install (after closing current spacy interactive sessions and switching to windows command line instead of powershell, but this might be local issues for my pc) and it indeed works now out of the box

thanks for coming back to me so quickly

ines · 2017-01-18T17:46:40Z

No worries – glad to hear it's working! 👍

lock · 2018-05-09T04:38:31Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

honnibal added the performance label Nov 1, 2016

honnibal added a commit that referenced this issue Nov 2, 2016

Add missing unicode_literals to spacy.util. I think this was messing …

9efe568

…up the tokenizer regex for non-ascii characters in Python 2. Re Issue #596

honnibal closed this as completed Nov 2, 2016

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

German: Quotation marks not correctly tokenized #596

German: Quotation marks not correctly tokenized #596

jgontrum commented Nov 1, 2016

honnibal commented Nov 1, 2016 •

edited

Loading

jgontrum commented Nov 2, 2016

honnibal commented Nov 2, 2016

jgontrum commented Nov 2, 2016

honnibal commented Nov 2, 2016

jgontrum commented Nov 3, 2016

schlichtanders commented Jan 18, 2017

ines commented Jan 18, 2017

schlichtanders commented Jan 18, 2017

ines commented Jan 18, 2017

schlichtanders commented Jan 18, 2017

ines commented Jan 18, 2017

lock bot commented May 9, 2018

German: Quotation marks not correctly tokenized #596

German: Quotation marks not correctly tokenized #596

Comments

jgontrum commented Nov 1, 2016

Example

Your Environment

honnibal commented Nov 1, 2016 • edited Loading

jgontrum commented Nov 2, 2016

honnibal commented Nov 2, 2016

jgontrum commented Nov 2, 2016

honnibal commented Nov 2, 2016

jgontrum commented Nov 3, 2016

schlichtanders commented Jan 18, 2017

ines commented Jan 18, 2017

schlichtanders commented Jan 18, 2017

ines commented Jan 18, 2017

schlichtanders commented Jan 18, 2017

ines commented Jan 18, 2017

lock bot commented May 9, 2018

honnibal commented Nov 1, 2016 •

edited

Loading