Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Character n-grams #40

Open
rth opened this issue Apr 29, 2019 · 4 comments
Open

Character n-grams #40

rth opened this issue Apr 29, 2019 · 4 comments
Labels
new feature This doesn't seem right

Comments

@rth
Copy link
Owner

rth commented Apr 29, 2019

Allowing tokenize documents with character n-grams would be useful.

@rth
Copy link
Owner Author

rth commented May 1, 2019

Partially addressed in #45

@rth rth added the new feature This doesn't seem right label May 3, 2019
@joshlk
Copy link
Collaborator

joshlk commented Jun 10, 2020

I could look into implementing a ngram and skipgram iterator? Similar to the util functions in NLTK http://www.nltk.org/_modules/nltk/util.html#ngrams for characters and words (#2).

@rth
Copy link
Owner Author

rth commented Jun 10, 2020

Thanks @joshlk that would be very useful! Maybe without the rightpad/leftpad options for a start? It would also be interesting to have something that would work with ngram_range parameter as in scikit-learn CountVectorizer,

The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams.

Though the extension of this parameter to skip grams is not clear.

There is also a question of how to chain tokenization + n-grams iterators #21

@joshlk
Copy link
Collaborator

joshlk commented Jul 6, 2020

PR: #82

Please take a look when you get a chance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature This doesn't seem right
Projects
None yet
Development

No branches or pull requests

2 participants