stopwords #649

rajhans · 2016-11-22T09:00:18Z

I have observed that spacy considers many common verbs like 'call' also as stopwords (as indicated by IS_STOP) which is a little out of ordinary. Is there any information that describes how spacy determines stopwords? Is there a way to get change the stopword criteria?

ines · 2016-11-22T17:08:41Z

Thanks! I'm actually in the process of finally reorganising the language data, so there will be an update soon that fixes this problem, among other things.

We're not very happy with the current stopword lists (or most other standard stopword lists that are available tbh). They're outdated and full of pre-processing artifacts, custom hacks and other stuff that's not relevant for spaCy (like don for "don't" etc.)

It's probably okay for information extraction, but not very useful for Machine Learning at the moment. So we want to use a slightly different and non-standard approach to determine what spaCy considers a stopword and how the language data is organised in the codebase.

We're always happy about input and suggestions – although there obviously won't be a 100% perfect solution, because in the end, it's always sort of arbitrary.

In the meantime, here's how you can customise the stopword behaviour. You can set attributes in the vocabulary, and tokens will inherit these attributes:

lex = nlp.vocab[u'call']
lex.is_stop = False
doc = nlp(u'Call me!')
[(w.text, w.is_stop) for w in doc]
# (u'Call', False), (u'me', True), (u'!', False)]

fmailhot · 2016-11-22T19:10:28Z

It would be helpful if the docs eventually included an explanation of the decision-making that went into whichever words end up being considered stopwords.

In my experience, it's better to err on the side of fewer than more for stopwords, and get a linguist's input (the NLTK list is actually pretty decent starting place, notwithstanding some of its flaws). You've shown that it's easy to customise stopword behaviour, so stopword-ifying e.g. very frequent words should be straightforward.

rajhans · 2016-11-22T21:17:24Z

+1 to fmailhot's comment. An explanation of stopwords decision would be helpful and (IMO or at least for my case) it is probably better to err on the conservative side when labeling stopwords as for most applications it is easier for users to explicitly label what they consider as stopwords (e.g. company names in a company corpora) than to explicitly 'unlist' words from stopwords.

nateGeorge · 2017-11-06T02:56:17Z

So where is the explanation/justification for the stopword list? This got closed so I assume the explanation was written somewhere. There are some words in there that don't make sense like 'call' and 'well'. I think it could use some improvement.

rcdilorenzo · 2018-03-17T16:13:28Z

I am also interested since the list seems to be multiple times the size of the nltk list. I'm not sure where the explanation is, but I thought it might be helpful to link to the English stop words list currently employed.

rcdilorenzo · 2018-03-17T16:20:49Z

@nateGeorge Actually, after digging through the git history, it looks like the list may have came from Stone, Dennis, Kwantes (2010) as seen in this line from the repository.

lock · 2018-05-07T21:52:43Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added the performance label Nov 22, 2016

ines mentioned this issue Nov 22, 2016

stop_words assigned but not used? #639

Closed

ines added this to the Reorganise language data milestone Nov 24, 2016

ines added the 🌙 nightly Discussion and contributions related to nightly builds label Dec 7, 2016

ines removed the 🌙 nightly Discussion and contributions related to nightly builds label Dec 18, 2016

ines closed this as completed Dec 18, 2016

lock bot locked as resolved and limited conversation to collaborators May 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stopwords #649

stopwords #649

rajhans commented Nov 22, 2016

ines commented Nov 22, 2016

fmailhot commented Nov 22, 2016

rajhans commented Nov 22, 2016 •

edited

Loading

nateGeorge commented Nov 6, 2017 •

edited

Loading

rcdilorenzo commented Mar 17, 2018

rcdilorenzo commented Mar 17, 2018

lock bot commented May 7, 2018

stopwords #649

stopwords #649

Comments

rajhans commented Nov 22, 2016

ines commented Nov 22, 2016

fmailhot commented Nov 22, 2016

rajhans commented Nov 22, 2016 • edited Loading

nateGeorge commented Nov 6, 2017 • edited Loading

rcdilorenzo commented Mar 17, 2018

rcdilorenzo commented Mar 17, 2018

lock bot commented May 7, 2018

rajhans commented Nov 22, 2016 •

edited

Loading

nateGeorge commented Nov 6, 2017 •

edited

Loading