Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Updated EntityExtractor to handle long texts and added better postprocessing #3154

Merged
merged 91 commits into from
Oct 17, 2022

Conversation

sjrl
Copy link
Contributor

@sjrl sjrl commented Sep 5, 2022

Related Issues

Proposed Changes:

This PR updates a number of aspects to the EntityExtractor node in haystack.

  • Update to default NER model to one of similar size, but with better metrics https://huggingface.co/elastic/distilbert-base-cased-finetuned-conll03-english
  • Addition of the aggregation_strategy option (set to first as default) which mitigates some of the issues identified in issue EntityExtractor can't deal well with out-of-vocabulary words #1706. More explanation is provided in issue EntityExtractor can't deal well with out-of-vocabulary words #1706.
  • Moved away from using the TokenClassificationPipeline provided by HuggingFace. This resulted in the largest changes because the HF pipeline silently truncated all document texts passed to it. Instead, following the inspiration of the Reader node, I added functionality to split long texts and feed each split into the model individually (or batches). Afterward, I recombine the splits grouped by the document they originally came from.
  • Added the option flatten_entities_in_meta_data so the entities can be stored in the metadata in a manner that can be used by the OpenSearch document store
  • Added a test for using the EntityExtractor in and indexing pipeline

Update:

  • Added new option pre_split_text so the user can optionally split all text by white space before passing it to the token classification model. This is common practice for NER pipelines, but is not out of the box supported by HuggingFace. As a result, more functionality was added to handle the post-processing of the model predictions when the text is pre-split into words. Namely, we determine word boundaries using the self.tokenizer.word_ids method and we update the character spans of the detected entities to correctly map back to the original (unsplit) text.
  • Added new optional parameter ignore_labels to allow users to specify what labels they would like to ignore.

How did you test it?

  • I made sure that the current tests for the EntityExtractor node still pass in the tests.
  • I also added new tests to test the new functionality added

Notes for the reviewer

  • Need to test EntityExtractor node in an indexing pipeline
  • Add test when using pre_split_text option
  • Add a test to check that the original text for unknown tokens is found

Checklist

sjrl added 30 commits August 25, 2022 16:03
…ndle documents with lenghts longer than model_max_length
…by splitting up the document into smaller samples similar to how the reader works
…ist of docs where each doc can be longer than the model max length. Does not work with batch size larger than 1 right now.
@sjrl sjrl marked this pull request as ready for review October 7, 2022 11:17
@sjrl
Copy link
Contributor Author

sjrl commented Oct 7, 2022

Hey, @vblagoje I would appreciate a review of the new code since your last review. I wrote a description of the new changes under the Update: header in the PR description (also reproduced below).

Update:

  • Added new option pre_split_text so the user can optionally split all text by white space before passing it to the token classification model. This is common practice for NER pipelines, but is not out of the box supported by HuggingFace. As a result, more functionality was added to handle the post-processing of the model predictions when the text is pre-split into words. Namely, we determine word boundaries using the self.tokenizer.word_ids method and we update the character spans of the detected entities to correctly map back to the original (unsplit) text.
  • Added new optional parameter ignore_labels to allow users to specify what labels they would like to ignore.

@sjrl
Copy link
Contributor Author

sjrl commented Oct 12, 2022

When running the NER node (and the unit tests) this warning from HF appears

Process finished with exit code 0
PASSED [100%]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Extracting entities: 100%|██████████| 3/3 [00:00<00:00, 35.96it/s]

This stems from HF being worried that we call the Fast tokenizer before creating a pytorch DataLoader.
According to the second answer in this this Stack Overflow Post this should be fine as long as the Fast Tokenizer is not called from within the DataLoader (excerpt from answer):

Alternatively convert your data to tokens beforehand and store them in a dict. Then your dataset should not use the tokenizer at all but during runtime simply calls the dict(key) where key is the index. This way you avoid conflict. The warning still comes but you simply dont use tokeniser during training any more (note for such scenarios to save space, avoid padding during tokenise and add later with collate_fn)

As recommended we create store the tokens in a dict and pass that to the DataLoader.

@vblagoje
Copy link
Member

@sjrl Why not add a unit test for a document whose length is larger than tokenizer max_seq_len to capture the driving force behind this PR and also to the unit test as well? Not sure if we already have such a doc somewhere in unit tests, but if not we can provide one in the unit test itself.

@sjrl
Copy link
Contributor Author

sjrl commented Oct 12, 2022

@sjrl Why not add a unit test for a document whose length is larger than tokenizer max_seq_len to capture the driving force behind this PR and also to the unit test as well? Not sure if we already have such a doc somewhere in unit tests, but if not we can provide one in the unit test itself.

Ahh I did do this by setting max_seq_len=6 for a few of the unit tests. So instead of a long doc I just shortened the amount of text the model can ingest in each step.

@sjrl
Copy link
Contributor Author

sjrl commented Oct 12, 2022

  • Add unit test using Jean-Baptiste/camembert-ner to test that "first" aggregation method works for it now (did not before)

Copy link
Member

@vblagoje vblagoje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sjrl sjrl merged commit 15a59fd into main Oct 17, 2022
@sjrl sjrl deleted the issue-2969 branch October 17, 2022 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants