Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add German Stopwords #638

Merged
merged 2 commits into from
Nov 20, 2016
Merged

Add German Stopwords #638

merged 2 commits into from
Nov 20, 2016

Conversation

souravsingh
Copy link
Contributor

@souravsingh souravsingh commented Nov 19, 2016

Add a list of stopwords for German Language.

Fixes Issue #364

Copy link
Member

@ines ines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thank you!

One question: I noticed that the list includes a and b – are these pre-processing artifacts, or are they there for a reason? I guess a makes sense as an alternate spelling of á (which should probably be in there as well). Not sure about b, though.

I totally hadn't realised this list was missing from the German data, btw. I have a few more ideas, so will be adding those later! 👍

@souravsingh
Copy link
Contributor Author

@ines Thanks for the review. The a and b in the texts are the result of my mistake. I apologize for that

@ExplodingCabbage
Copy link
Contributor

@souravsingh, how did you compile this list? Is it taken from a single source, or multiple? To what extent have you personally tweaked that data by adding words you noticed were missing or removing words you thought were inappropriate? This list looks similar, but not identical, to https://github.com/wgpsutherland/stopwords/blob/master/dist/de.json; I couldn't find any other plausible-looking sources on Google.

(Note that I'm not a project maintainer, just a random guy from the internet, but if I were in @ines's place I'd personally want to know what the source of the data was before merging.)

@souravsingh
Copy link
Contributor Author

@ExplodingCabbage The list was taken from multiple sources. The list was compiled from the website here-http://codingwiththomas.blogspot.in/2012/01/german-stop-words.html and from the stopwords list from Apache Lucene. I had some knowledge of German, so identifying which ones are plausible stopwords wasn't really difficult.

@ines
Copy link
Member

ines commented Nov 20, 2016

Thanks for the info! Will merge this now and make a few edits and additions.

Btw, @honnibal and I went over the current state of the language data earlier and it could definitely need some better organisation. Starting with basic formatting, but also more complex stuff – for example, having a global module for emoticons that can be imported across languages (instead of having the same data live in each language).

So I'll be taking this on over the next week or so 😃 English and German will be easy (native German speaker here), but we might post a call for native speakers for the other languages soon, just to have another pair of eyes making sure it's all good.

We have such a great community from all over the world, so we should be able to do this! 💪

@ines ines merged commit d24aaad into explosion:master Nov 20, 2016
@souravsingh souravsingh deleted the add-stopwords branch November 20, 2016 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants