Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NER workflow improvement #33579

Open
ain-soph opened this issue Sep 18, 2024 · 0 comments
Open

NER workflow improvement #33579

ain-soph opened this issue Sep 18, 2024 · 0 comments
Labels
Feature request Request for a new feature

Comments

@ain-soph
Copy link

Feature request

  1. run_ner.py in examples are requiring data of pre-tokenized words, like conll2003 dataset, where texts have been splitted into words. However, the TokenClassificationPipeline has no such pre-tokenization but simply aggregate pre-entities by checking is_subword using whitespace separator.
    This might work somehow on English words, but would fail when it is inconsistent with pre-tokenization used in training data (which would raise multiple B- for all sub-tokens in one pre-token)
    My solution: In the TokenClassificationPipeline, use the pre-tokenizer to mark is_subword for all non-first sub-tokens to aggregate them in gather_pre_entities.
  2. run_ner.py shall support chunks during training. For example, XLM-Roberta-Base supports 512 as the maximum sequence length, and we might suffer with memory concern to limit the sequence length during training for large models. In this case, it might be better to split long training data into chunks (with certain overlapping) to fully utilize the information instead of truncation.
  3. For label_all_tokens argument in run_ner.py, it means we currently have 2 options for sub-token labelling: first and all. It might be better to provide a last option to better support for Decoder-only models. See Fine Tuned GPT2 model performs very poorly on token classification task #15389 (comment)

Motivation

Included in Feature request Section.

Your contribution

I currently implement some parts of them in some private confidential codes during work. I would help if repo maintainers think it worth doing.

@ain-soph ain-soph added the Feature request Request for a new feature label Sep 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

1 participant