Feature/kskip-ngram #82

joshlk · 2020-07-06T15:59:02Z

Implemented k-skip-n-grams. Has convenience functions for bigram, trigram, ngrams, everygrams and skipgrams.

Provides the same output as the equivalent nltk functions - although nltk does generate duplicates sometimes which are omitted here. The iterator consumes the input iterator only once and holds a window of items to generate the grams. The window is stepped forward as it consumes the input. It also correctly generates left or right padding if specified.

Currently the iterator outputs Vec<Vec<&str>>. I'm unsure if this is desirable or the function should join the items into a string to output Vec<String>. e.g. Currently it does: vec![vec!["One", "Two"], vec!["Two", "Three"] ...] but it might be desirable to have vec!["One Two", "Two Three"].

The nltk implementations tend to consume the input sequence multiple times and concatenate the output (for example everygram). I wanted the struct to consume the input iterator only once and provide an output. This however considerably complicated the code. I have tried to refractor it so it is as readable as much as possible but it would be good to get a second eye on it.

There is also still the question of how to chain the different components together (which links to #21). Currently the transform method takes an input iterator and provides an iterator as output which can be consumed by the user.

Todo:

Add Python interface
Benchmark against nltk
~~Add function for character ngrams~~
Further documentation

joshlk · 2020-07-07T09:57:38Z

Some more information to provide some background...

Features:

"ngram" - for example if n=2 it produces bigrams for example [["One", "Two"], ["Two", "Three"] ...]
"everygram" - all ngrams within a min_n and max_n range. e.g. min_n=1, max_n=2 [["One"], ["One", "Two"], ["Two"], ["Two", "Three"] ...]
"skipgram" - skip a max total of k items. e.g. if n=2, k=2: [["One", "Two"], ["One", "Three"], ["One", "Four"], ["Two", "Three"], ["Two", "Four"], ["Two", "Five"] ...]
padding - pad left or right items. e.g. if n=2, k=2, pad_left="<s>": [["<s>", "One"], ["<s>", "Two"], ["<s>", "Three"] ...]

An overview of how KSkipNGramsIter works:

On initialisation of the struct build_window populates the first few items into a window of items
The KSkipNGramsIter then has 4 modes which is executes in order exhausting each one.
The first mode is PadLeft which generates the left padding items from the pre-populated window. It essentially iterates through a number of parameters which generate different samples of items from the window:
- n - the number of items i.e. 1, 2, 3. e.g. [["One"], ["One", "Two"], ["One", "Two", "Three"] ...] The range defined by min_n and max_n
- p - the amount of padding. For example the gram ["<s>", "One", "Two"] has padding=1, n=3. The range is between 0 and max_n-1
- sample_iter - this is a separate iterator which is initialised for each n,p combination. When there is no skipping (max_k=0) it just produces the indexes of the first elements i.e. ["One", "Two"]. But when skipping is enabled (k>0) it iterates through all possible combinations. For example if k = 2 (n=2, p=0) it would produce the index's of [["One", "Two"], ["One", "Three"], ["One", "Four"], ["Two", "Three"], ["Three", "Four"]]
Second mode is Main. For each step in the window it iterates through all possible n's and gram skipping (sample_iter i.e. k>0). Then the window is stepped one forward.
Third mode is MainEnd. This samples from the last window (when the input iterator has been exhausted). This is only necessary when n has a range (i.e. min_n != max_n). Here an offset parameter is incremented which simulates a smaller window .
Four mode is PadRight. This is similar to PadLeft as the name suggests.

Let me know if you require further clarification or information. Happy to help and take feedback 😃

rth

Thanks for working on this! A very superficial review below,

I'm unsure if this is desirable or the function should join the items into a string to output Vec

I'm also unsure about it. On one side with joined tokens it would be easier to use them later. E.g. in when counting tokens, whether it's unigrams or skip-grams the code would work identically. On the other, it would probably have some performance cost as one would be able to use &str and would need to create new strings for aggregated tokens. It would also make it difficult to recover tokens from the ngram (scikit-learn/scikit-learn#13733). I think the current approach is fine.

I wanted the struct to consume the input iterator only once and provide an output. This however considerably complicated the code.

Do we know the performance impact of of making multiple passes over the iterator, e.g. with everygrams? I understand the issue with multiple passes over the data as then one would have disk I/O cost, but for a single document, the cost wouldn't likely be too much? Also realistically when using this estimator in a pipeline with a tokenizer, if the run time for this step is much smaller than the tokenizer in any case, optimizing it too much might not be very noticeable.

It's very nice work, I'm just a bit concerned about the complexity of the implementation for future maintenance.

rth · 2020-07-07T10:35:54Z

src/ngram_utils/mod.rs

+    /// assert_eq!(grams, vec![vec!["One", "Two", "Three"], vec!["Two", "Three", "Four"]]);
+    /// ```
+    pub fn new_trigram() -> KSkipNGrams {
+        KSkipNGramsParams::new(3, 3, 0).build()


Do you think we need these methods? For instance scikit-learn has only a ngram_range parameter that corresponds to the first two numbers and users seem to find it fine. The additional max_k parameter would be only a small complexity on top of it, in terms of API.

I added them as NLTK had them as connivance methods but happy to remove them. It is useful for demonstrating the different examples.

src/ngram_utils/mod.rs

src/lib.rs

joshlk · 2020-07-15T13:43:07Z

It's very nice work, I'm just a bit concerned about the complexity of the implementation for future maintenance.

Fair point. I will implement a simpler version and then we can compare performance. You would also loose the ordering of tokens but this isn't such a big problem.

…ed via tests

joshlk · 2020-07-16T11:06:45Z

I have changed the approach and it now chains ngram or skipgram iterators together. Overall its much simpler to understand. Only issue is that you loose the ordering of the tokens as said before but thats not such a big issue.

I did some simple benchmarking which aren't included in this pull request and the performance is actually 2x faster which was surprising as the input iterator is split using tee and multiple caches are used. So overall much better. Let me know what you think.

joshlk · 2020-07-16T14:53:59Z

I have added the Python API. Currently it takes a list and output a list. Do we want the Python interface to take a iterator as input and produce an iterator?

rth · 2020-07-17T11:39:50Z

Thanks a lot @joshlk ! I'll review in more detail tomorrow but after a superficial glance this version does significantly less complex.

Do we want the Python interface to take a iterator as input and produce an iterator?

I think a list is fine. That's what we are doing for tokenizers. If we change that later we should do it later everywhere, but I suspect that lists won't have much performance impact and are easier to manipulate.

There is just a linting issue in the CI.

rth

Thank you for the thorough work @joshlk ! A very minor comment otherwise LGTM. Would you mind also running cargo clippy and checking that there are no new warnings in the code? We should probably add clippy to CI.

python/vtext/tests/test_token_processing.py

rth · 2020-07-21T23:03:04Z

I merged one minor suggestion in the test and then had to fix issues it introduced, sorry )

There is indeed a few clippy errors that it would be good to fix,

warning: unneeded `return` statement
   --> src/token_processing/mod.rs:339:9
    |
339 |         return Some(Vec::from(self.window.clone()));
    |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: remove `return`: `Some(Vec::from(self.window.clone()))`
    |
    = note: `#[warn(clippy::needless_return)]` on by default
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_return

warning: unneeded `return` statement
   --> src/token_processing/mod.rs:446:9
    |
446 | /         return match next_sample {
447 | |             // Generate and return samples using self.sample_iter
448 | |             Some(sample_idx) => {
449 | |                 let mut sample = Vec::with_capacity(sample_idx.len());
...   |
489 | |             }
490 | |         };
    | |__________^
    |
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_return
help: remove `return`
    |
446 |         match next_sample {
447 |             // Generate and return samples using self.sample_iter
448 |             Some(sample_idx) => {
449 |                 let mut sample = Vec::with_capacity(sample_idx.len());
450 |                 for idx in sample_idx.iter() {
451 |                     sample.push(self.window[*idx].clone());
  ...

error: using `clone` on a double-reference; this will copy the reference instead of cloning the inner type
   --> src/token_processing/mod.rs:451:33
    |
451 |                     sample.push(self.window[*idx].clone());
    |                                 ^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: `#[deny(clippy::clone_double_ref)]` on by default
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#clone_double_ref
help: try dereferencing it
    |
451 |                     sample.push(&(*self.window[*idx]).clone());
    |                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
help: or try being explicit if you are sure, that you want to clone a reference
    |
451 |                     sample.push(<&str>::clone(self.window[*idx]));
    |                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Otherwise we can merge, unless you wanted to add other things?

rth · 2020-07-21T23:15:45Z

Though in terms of API it's maybe not ideal that if there are not enough tokens e.g. for KSkipNGrams(min_n=1, max_n=1, max_k=1) (here 2) it would error. It's a very typical situation to have few or zero tokens for some documents and it might be better to return an empty list.

Also I'm actually somewhat confused as to why KSkipNGrams(min_n=1, max_n=1, max_k=k) returns the input as is (with an extra list container) for any k. I would have expected skip grams to only make sense for max_n >=2 and raise an error otherwise, or am I missing something?

``` # Testing 19924 documents vtext: everygram: 133.95s [0.7 MB/s, 54 kWPS] nltk: everygram: 5.06s [18.0 MB/s, 1419 kWPS] vtext: skipgram: 498.58s [0.2 MB/s, 14 kWPS] nltk: skipgram: 18.01s [5.1 MB/s, 399 kWPS] ```

joshlk · 2020-07-22T19:17:48Z

Though in terms of API it's maybe not ideal that if there are not enough tokens e.g. for KSkipNGrams(min_n=1, max_n=1, max_k=1) (here 2) it would error. It's a very typical situation to have few or zero tokens for some documents and it might be better to return an empty list.

Agreed. I have changed it so that empty input or input that is smaller than n gives an empty output. I have added tests for these cases. On a side note, strangely "".split(" ") does not provide an empty list in rust but gives an output of [""]. This could be one to watch out for.

Also I'm actually somewhat confused as to why KSkipNGrams(min_n=1, max_n=1, max_k=k) returns the input as is (with an extra list container) for any k. I would have expected skip grams to only make sense for max_n >=2 and raise an error otherwise, or am I missing something?

Because k can equal 0 which doesn't not require max_n >=2.

cargo clippy

I went through the cargo clippy suggestions which were very useful. Thanks for bring my attention to that tool. Might be worth adding into the CI toolchain. There were a few suggestions for other files in the package but I left those alone.

Otherwise we can merge, unless you wanted to add other things?

Do we want to change CharacterTokenizer so it uses this n-gram interface as the backend? I also noticed it uses char_indices to split the string into characters, would it be better to use grapheme_indices from the unicode segmentation package?

Also, do we want to include the connivance methods that are included in NLTK in python and rust? (e.g. bigram, everygram ...)

benchmark

I have created some benchmarks in Python and the vtext functions are currently considerably slower than the nltk equivalents. Not sure why this is the case at this point. Maybe to do with the conversion of strings from rust to python. This was the output:

# Testing 19924 documents
vtext: everygram: 133.95s [0.7 MB/s, 54 kWPS]
nltk: everygram: 5.06s [18.0 MB/s, 1419 kWPS]
vtext: skipgram: 498.58s [0.2 MB/s, 14 kWPS]
nltk: skipgram: 18.01s [5.1 MB/s, 399 kWPS]

rth · 2020-07-22T19:21:32Z

This was the output:

Are you sure you run setup.py install and not develop to be in release mode?

rth · 2020-07-22T19:44:21Z

Do we want to change CharacterTokenizer so it uses this n-gram interface as the backend? I also noticed it uses char_indices to split the string into characters, would it be better to use grapheme_indices from the unicode segmentation package?

Good point. Let's keep it for now and consider that in a follow up PR as this one is already getting quite large.

Also, do we want to include the convenience methods that are included in NLTK in python and rust? (e.g. bigram, everygram ...)

Well NLTK doesn't really a have too homogeneous and consistent API across modules, as far as I can tell so I would be careful not too adapt it too closely. But we certainly need to think to how to make all these parts work smoothly.

text functions are currently considerably slower than the nltk equivalents.

Hmm, yes I can confirm with setup.py install. We would need to profile (e.g. with flamegraph to understand what's going on. I would say it's shouldn't be rust->python conversions since we are doing that for character ngrams in CharacterTokenizer and that's working fine in Python. I can look into profiling as well in the next few days..

Another unrelated thing I was wondering about, it wouldn't be faster and to easier map tokens to an index at the beginning, pass those usize indices around and finally recover the list of n-gram string tokens at the end. Technically &str should be equivalent I guess but we would have less issues with lifetimes and potentially the cost of making copies.

joshlk added 7 commits July 1, 2020 15:27

[WIP] Ngrams: parity with ngrams and everygram but not with skipgram

a0ad101

[WIP] skipgram n=3, k=1 parity with nltk

76010c0

[WIP] parity with all nltk functions

eb373f7

[WIP] refactored using index combinations

062aefe

Tidy code

4e0d95d

Refactored and documented

a37592f

Refactored and documented

9462475

joshlk requested a review from rth July 6, 2020 15:59

joshlk mentioned this pull request Jul 6, 2020

Character n-grams #40

Open

Added documentation to SampleCombinations

56f756f

rth reviewed Jul 7, 2020

View reviewed changes

joshlk added 2 commits July 15, 2020 14:27

Change error to EstimatorErr enum and add InputError variant

da791fe

Make struct private and improve error

146a13b

joshlk added 2 commits July 16, 2020 11:26

Simplified code by chaining iterators. Parity with previous code test…

461fd52

…ed via tests

Tidied code, docs and cargo fmt

e6dab70

joshlk added 2 commits July 16, 2020 12:10

Module name renamed to token_processing

298ca35

KSkipNGrams Python API

9453129

Fixed doc

5f6698e

rth reviewed Jul 20, 2020

View reviewed changes

python/vtext/tests/test_token_processing.py Outdated Show resolved Hide resolved

rth added 3 commits July 21, 2020 23:06

Update python/vtext/tests/test_token_processing.py

b37e47d

Fix syntax error

3bea027

Minor fixes in the hypothesis test

e382b97

Increase min_size for tokens in python tests

dc80e5d

joshlk added 4 commits July 22, 2020 17:19

cargo clippy suggestions

d58ac99

Empty or input that is smaller than n gives empty output

7b7ef18

ngram and skipgram benchmark:

6f1cadf

``` # Testing 19924 documents vtext: everygram: 133.95s [0.7 MB/s, 54 kWPS] nltk: everygram: 5.06s [18.0 MB/s, 1419 kWPS] vtext: skipgram: 498.58s [0.2 MB/s, 14 kWPS] nltk: skipgram: 18.01s [5.1 MB/s, 399 kWPS] ```

cargo fmt

083b4ec

black python/ benchmarks/ evaluation/

6345cf3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/kskip-ngram #82

Feature/kskip-ngram #82

joshlk commented Jul 6, 2020 •

edited

Loading

joshlk commented Jul 7, 2020 •

edited

Loading

rth left a comment

rth Jul 7, 2020

joshlk Jul 15, 2020

joshlk commented Jul 15, 2020

joshlk commented Jul 16, 2020

joshlk commented Jul 16, 2020

rth commented Jul 17, 2020

rth left a comment

rth commented Jul 21, 2020

rth commented Jul 21, 2020 •

edited

Loading

joshlk commented Jul 22, 2020

rth commented Jul 22, 2020

rth commented Jul 22, 2020

Feature/kskip-ngram #82

Are you sure you want to change the base?

Feature/kskip-ngram #82

Conversation

joshlk commented Jul 6, 2020 • edited Loading

joshlk commented Jul 7, 2020 • edited Loading

rth left a comment

Choose a reason for hiding this comment

rth Jul 7, 2020

Choose a reason for hiding this comment

joshlk Jul 15, 2020

Choose a reason for hiding this comment

joshlk commented Jul 15, 2020

joshlk commented Jul 16, 2020

joshlk commented Jul 16, 2020

rth commented Jul 17, 2020

rth left a comment

Choose a reason for hiding this comment

rth commented Jul 21, 2020

rth commented Jul 21, 2020 • edited Loading

joshlk commented Jul 22, 2020

rth commented Jul 22, 2020

rth commented Jul 22, 2020

joshlk commented Jul 6, 2020 •

edited

Loading

joshlk commented Jul 7, 2020 •

edited

Loading

rth commented Jul 21, 2020 •

edited

Loading