Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/kskip-ngram #82

Open
wants to merge 24 commits into
base: main
Choose a base branch
from
Open

Feature/kskip-ngram #82

wants to merge 24 commits into from

Conversation

joshlk
Copy link
Collaborator

@joshlk joshlk commented Jul 6, 2020

Implemented k-skip-n-grams. Has convenience functions for bigram, trigram, ngrams, everygrams and skipgrams.

Provides the same output as the equivalent nltk functions - although nltk does generate duplicates sometimes which are omitted here. The iterator consumes the input iterator only once and holds a window of items to generate the grams. The window is stepped forward as it consumes the input. It also correctly generates left or right padding if specified.

Currently the iterator outputs Vec<Vec<&str>>. I'm unsure if this is desirable or the function should join the items into a string to output Vec<String>. e.g. Currently it does: vec![vec!["One", "Two"], vec!["Two", "Three"] ...] but it might be desirable to have vec!["One Two", "Two Three"].

The nltk implementations tend to consume the input sequence multiple times and concatenate the output (for example everygram). I wanted the struct to consume the input iterator only once and provide an output. This however considerably complicated the code. I have tried to refractor it so it is as readable as much as possible but it would be good to get a second eye on it.

There is also still the question of how to chain the different components together (which links to #21). Currently the transform method takes an input iterator and provides an iterator as output which can be consumed by the user.

Todo:

  • Add Python interface
  • Benchmark against nltk
  • Add function for character ngrams
  • Further documentation

@joshlk joshlk requested a review from rth July 6, 2020 15:59
@joshlk joshlk mentioned this pull request Jul 6, 2020
@joshlk
Copy link
Collaborator Author

joshlk commented Jul 7, 2020

Some more information to provide some background...

Features:

  • "ngram" - for example if n=2 it produces bigrams for example [["One", "Two"], ["Two", "Three"] ...]
  • "everygram" - all ngrams within a min_n and max_n range. e.g. min_n=1, max_n=2 [["One"], ["One", "Two"], ["Two"], ["Two", "Three"] ...]
  • "skipgram" - skip a max total of k items. e.g. if n=2, k=2: [["One", "Two"], ["One", "Three"], ["One", "Four"], ["Two", "Three"], ["Two", "Four"], ["Two", "Five"] ...]
  • padding - pad left or right items. e.g. if n=2, k=2, pad_left="<s>": [["<s>", "One"], ["<s>", "Two"], ["<s>", "Three"] ...]

An overview of how KSkipNGramsIter works:

  • On initialisation of the struct build_window populates the first few items into a window of items
  • The KSkipNGramsIter then has 4 modes which is executes in order exhausting each one.
  • The first mode is PadLeft which generates the left padding items from the pre-populated window. It essentially iterates through a number of parameters which generate different samples of items from the window:
    • n - the number of items i.e. 1, 2, 3. e.g. [["One"], ["One", "Two"], ["One", "Two", "Three"] ...] The range defined by min_n and max_n
    • p - the amount of padding. For example the gram ["<s>", "One", "Two"] has padding=1, n=3. The range is between 0 and max_n-1
    • sample_iter - this is a separate iterator which is initialised for each n,p combination. When there is no skipping (max_k=0) it just produces the indexes of the first elements i.e. ["One", "Two"]. But when skipping is enabled (k>0) it iterates through all possible combinations. For example if k = 2 (n=2, p=0) it would produce the index's of [["One", "Two"], ["One", "Three"], ["One", "Four"], ["Two", "Three"], ["Three", "Four"]]
  • Second mode is Main. For each step in the window it iterates through all possible n's and gram skipping (sample_iter i.e. k>0). Then the window is stepped one forward.
  • Third mode is MainEnd. This samples from the last window (when the input iterator has been exhausted). This is only necessary when n has a range (i.e. min_n != max_n). Here an offset parameter is incremented which simulates a smaller window .
  • Four mode is PadRight. This is similar to PadLeft as the name suggests.

Let me know if you require further clarification or information. Happy to help and take feedback 😃

Copy link
Owner

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! A very superficial review below,

I'm unsure if this is desirable or the function should join the items into a string to output Vec

I'm also unsure about it. On one side with joined tokens it would be easier to use them later. E.g. in when counting tokens, whether it's unigrams or skip-grams the code would work identically. On the other, it would probably have some performance cost as one would be able to use &str and would need to create new strings for aggregated tokens. It would also make it difficult to recover tokens from the ngram (scikit-learn/scikit-learn#13733). I think the current approach is fine.

I wanted the struct to consume the input iterator only once and provide an output. This however considerably complicated the code.

Do we know the performance impact of of making multiple passes over the iterator, e.g. with everygrams? I understand the issue with multiple passes over the data as then one would have disk I/O cost, but for a single document, the cost wouldn't likely be too much? Also realistically when using this estimator in a pipeline with a tokenizer, if the run time for this step is much smaller than the tokenizer in any case, optimizing it too much might not be very noticeable.

It's very nice work, I'm just a bit concerned about the complexity of the implementation for future maintenance.

/// assert_eq!(grams, vec![vec!["One", "Two", "Three"], vec!["Two", "Three", "Four"]]);
/// ```
pub fn new_trigram() -> KSkipNGrams {
KSkipNGramsParams::new(3, 3, 0).build()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we need these methods? For instance scikit-learn has only a ngram_range parameter that corresponds to the first two numbers and users seem to find it fine. The additional max_k parameter would be only a small complexity on top of it, in terms of API.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added them as NLTK had them as connivance methods but happy to remove them. It is useful for demonstrating the different examples.

src/ngram_utils/mod.rs Outdated Show resolved Hide resolved
src/ngram_utils/mod.rs Outdated Show resolved Hide resolved
src/ngram_utils/mod.rs Outdated Show resolved Hide resolved
src/ngram_utils/mod.rs Outdated Show resolved Hide resolved
src/lib.rs Outdated Show resolved Hide resolved
@joshlk
Copy link
Collaborator Author

joshlk commented Jul 15, 2020

It's very nice work, I'm just a bit concerned about the complexity of the implementation for future maintenance.

Fair point. I will implement a simpler version and then we can compare performance. You would also loose the ordering of tokens but this isn't such a big problem.

@joshlk
Copy link
Collaborator Author

joshlk commented Jul 16, 2020

I have changed the approach and it now chains ngram or skipgram iterators together. Overall its much simpler to understand. Only issue is that you loose the ordering of the tokens as said before but thats not such a big issue.

I did some simple benchmarking which aren't included in this pull request and the performance is actually 2x faster which was surprising as the input iterator is split using tee and multiple caches are used. So overall much better. Let me know what you think.

@joshlk
Copy link
Collaborator Author

joshlk commented Jul 16, 2020

I have added the Python API. Currently it takes a list and output a list. Do we want the Python interface to take a iterator as input and produce an iterator?

@rth
Copy link
Owner

rth commented Jul 17, 2020

Thanks a lot @joshlk ! I'll review in more detail tomorrow but after a superficial glance this version does significantly less complex.

Do we want the Python interface to take a iterator as input and produce an iterator?

I think a list is fine. That's what we are doing for tokenizers. If we change that later we should do it later everywhere, but I suspect that lists won't have much performance impact and are easier to manipulate.

There is just a linting issue in the CI.

Copy link
Owner

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the thorough work @joshlk ! A very minor comment otherwise LGTM. Would you mind also running cargo clippy and checking that there are no new warnings in the code? We should probably add clippy to CI.

python/vtext/tests/test_token_processing.py Outdated Show resolved Hide resolved
@rth
Copy link
Owner

rth commented Jul 21, 2020

I merged one minor suggestion in the test and then had to fix issues it introduced, sorry )

There is indeed a few clippy errors that it would be good to fix,

warning: unneeded `return` statement
   --> src/token_processing/mod.rs:339:9
    |
339 |         return Some(Vec::from(self.window.clone()));
    |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ help: remove `return`: `Some(Vec::from(self.window.clone()))`
    |
    = note: `#[warn(clippy::needless_return)]` on by default
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_return

warning: unneeded `return` statement
   --> src/token_processing/mod.rs:446:9
    |
446 | /         return match next_sample {
447 | |             // Generate and return samples using self.sample_iter
448 | |             Some(sample_idx) => {
449 | |                 let mut sample = Vec::with_capacity(sample_idx.len());
...   |
489 | |             }
490 | |         };
    | |__________^
    |
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#needless_return
help: remove `return`
    |
446 |         match next_sample {
447 |             // Generate and return samples using self.sample_iter
448 |             Some(sample_idx) => {
449 |                 let mut sample = Vec::with_capacity(sample_idx.len());
450 |                 for idx in sample_idx.iter() {
451 |                     sample.push(self.window[*idx].clone());
  ...

error: using `clone` on a double-reference; this will copy the reference instead of cloning the inner type
   --> src/token_processing/mod.rs:451:33
    |
451 |                     sample.push(self.window[*idx].clone());
    |                                 ^^^^^^^^^^^^^^^^^^^^^^^^^
    |
    = note: `#[deny(clippy::clone_double_ref)]` on by default
    = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#clone_double_ref
help: try dereferencing it
    |
451 |                     sample.push(&(*self.window[*idx]).clone());
    |                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
help: or try being explicit if you are sure, that you want to clone a reference
    |
451 |                     sample.push(<&str>::clone(self.window[*idx]));
    |                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


Otherwise we can merge, unless you wanted to add other things?

@rth
Copy link
Owner

rth commented Jul 21, 2020

Though in terms of API it's maybe not ideal that if there are not enough tokens e.g. for KSkipNGrams(min_n=1, max_n=1, max_k=1) (here 2) it would error. It's a very typical situation to have few or zero tokens for some documents and it might be better to return an empty list.

Also I'm actually somewhat confused as to why KSkipNGrams(min_n=1, max_n=1, max_k=k) returns the input as is (with an extra list container) for any k. I would have expected skip grams to only make sense for max_n >=2 and raise an error otherwise, or am I missing something?

```
# Testing 19924 documents
vtext: everygram: 133.95s [0.7 MB/s, 54 kWPS]
nltk: everygram: 5.06s [18.0 MB/s, 1419 kWPS]
vtext: skipgram: 498.58s [0.2 MB/s, 14 kWPS]
nltk: skipgram: 18.01s [5.1 MB/s, 399 kWPS]
```
@joshlk
Copy link
Collaborator Author

joshlk commented Jul 22, 2020

Though in terms of API it's maybe not ideal that if there are not enough tokens e.g. for KSkipNGrams(min_n=1, max_n=1, max_k=1) (here 2) it would error. It's a very typical situation to have few or zero tokens for some documents and it might be better to return an empty list.

Agreed. I have changed it so that empty input or input that is smaller than n gives an empty output. I have added tests for these cases. On a side note, strangely "".split(" ") does not provide an empty list in rust but gives an output of [""]. This could be one to watch out for.

Also I'm actually somewhat confused as to why KSkipNGrams(min_n=1, max_n=1, max_k=k) returns the input as is (with an extra list container) for any k. I would have expected skip grams to only make sense for max_n >=2 and raise an error otherwise, or am I missing something?

Because k can equal 0 which doesn't not require max_n >=2.

cargo clippy

I went through the cargo clippy suggestions which were very useful. Thanks for bring my attention to that tool. Might be worth adding into the CI toolchain. There were a few suggestions for other files in the package but I left those alone.

Otherwise we can merge, unless you wanted to add other things?

Do we want to change CharacterTokenizer so it uses this n-gram interface as the backend? I also noticed it uses char_indices to split the string into characters, would it be better to use grapheme_indices from the unicode segmentation package?

Also, do we want to include the connivance methods that are included in NLTK in python and rust? (e.g. bigram, everygram ...)

benchmark

I have created some benchmarks in Python and the vtext functions are currently considerably slower than the nltk equivalents. Not sure why this is the case at this point. Maybe to do with the conversion of strings from rust to python. This was the output:

# Testing 19924 documents
vtext: everygram: 133.95s [0.7 MB/s, 54 kWPS]
nltk: everygram: 5.06s [18.0 MB/s, 1419 kWPS]
vtext: skipgram: 498.58s [0.2 MB/s, 14 kWPS]
nltk: skipgram: 18.01s [5.1 MB/s, 399 kWPS]

@rth
Copy link
Owner

rth commented Jul 22, 2020

This was the output:

Are you sure you run setup.py install and not develop to be in release mode?

@rth
Copy link
Owner

rth commented Jul 22, 2020

Do we want to change CharacterTokenizer so it uses this n-gram interface as the backend? I also noticed it uses char_indices to split the string into characters, would it be better to use grapheme_indices from the unicode segmentation package?

Good point. Let's keep it for now and consider that in a follow up PR as this one is already getting quite large.

Also, do we want to include the convenience methods that are included in NLTK in python and rust? (e.g. bigram, everygram ...)

Well NLTK doesn't really a have too homogeneous and consistent API across modules, as far as I can tell so I would be careful not too adapt it too closely. But we certainly need to think to how to make all these parts work smoothly.

text functions are currently considerably slower than the nltk equivalents.

Hmm, yes I can confirm with setup.py install. We would need to profile (e.g. with flamegraph to understand what's going on. I would say it's shouldn't be rust->python conversions since we are doing that for character ngrams in CharacterTokenizer and that's working fine in Python. I can look into profiling as well in the next few days..

Another unrelated thing I was wondering about, it wouldn't be faster and to easier map tokens to an index at the beginning, pass those usize indices around and finally recover the list of n-gram string tokens at the end. Technically &str should be equivalent I guess but we would have less issues with lifetimes and potentially the cost of making copies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants