Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stopwords #649

Closed
rajhans opened this issue Nov 22, 2016 · 7 comments
Closed

stopwords #649

rajhans opened this issue Nov 22, 2016 · 7 comments

Comments

@rajhans
Copy link

rajhans commented Nov 22, 2016

I have observed that spacy considers many common verbs like 'call' also as stopwords (as indicated by IS_STOP) which is a little out of ordinary. Is there any information that describes how spacy determines stopwords? Is there a way to get change the stopword criteria?

@ines
Copy link
Member

ines commented Nov 22, 2016

Thanks! I'm actually in the process of finally reorganising the language data, so there will be an update soon that fixes this problem, among other things.

We're not very happy with the current stopword lists (or most other standard stopword lists that are available tbh). They're outdated and full of pre-processing artifacts, custom hacks and other stuff that's not relevant for spaCy (like don for "don't" etc.)

It's probably okay for information extraction, but not very useful for Machine Learning at the moment. So we want to use a slightly different and non-standard approach to determine what spaCy considers a stopword and how the language data is organised in the codebase.

We're always happy about input and suggestions – although there obviously won't be a 100% perfect solution, because in the end, it's always sort of arbitrary.

In the meantime, here's how you can customise the stopword behaviour. You can set attributes in the vocabulary, and tokens will inherit these attributes:

lex = nlp.vocab[u'call']
lex.is_stop = False
doc = nlp(u'Call me!')
[(w.text, w.is_stop) for w in doc]
# (u'Call', False), (u'me', True), (u'!', False)]

@fmailhot
Copy link

It would be helpful if the docs eventually included an explanation of the decision-making that went into whichever words end up being considered stopwords.

In my experience, it's better to err on the side of fewer than more for stopwords, and get a linguist's input (the NLTK list is actually pretty decent starting place, notwithstanding some of its flaws). You've shown that it's easy to customise stopword behaviour, so stopword-ifying e.g. very frequent words should be straightforward.

@rajhans
Copy link
Author

rajhans commented Nov 22, 2016

+1 to fmailhot's comment. An explanation of stopwords decision would be helpful and (IMO or at least for my case) it is probably better to err on the conservative side when labeling stopwords as for most applications it is easier for users to explicitly label what they consider as stopwords (e.g. company names in a company corpora) than to explicitly 'unlist' words from stopwords.

@ines ines added this to the Reorganise language data milestone Nov 24, 2016
@ines ines added the 🌙 nightly Discussion and contributions related to nightly builds label Dec 7, 2016
@ines ines removed the 🌙 nightly Discussion and contributions related to nightly builds label Dec 18, 2016
@ines ines closed this as completed Dec 18, 2016
@nateGeorge
Copy link

nateGeorge commented Nov 6, 2017

So where is the explanation/justification for the stopword list? This got closed so I assume the explanation was written somewhere. There are some words in there that don't make sense like 'call' and 'well'. I think it could use some improvement.

@rcdilorenzo
Copy link

I am also interested since the list seems to be multiple times the size of the nltk list. I'm not sure where the explanation is, but I thought it might be helpful to link to the English stop words list currently employed.

@rcdilorenzo
Copy link

@nateGeorge Actually, after digging through the git history, it looks like the list may have came from Stone, Dennis, Kwantes (2010) as seen in this line from the repository.

@lock
Copy link

lock bot commented May 7, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 7, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants