NER workflow improvement #33579

ain-soph · 2024-09-18T20:25:42Z

run_ner.py in examples are requiring data of pre-tokenized words, like conll2003 dataset, where texts have been splitted into words. However, the TokenClassificationPipeline has no such pre-tokenization but simply aggregate pre-entities by checking is_subword using whitespace separator.
This might work somehow on English words, but would fail when it is inconsistent with pre-tokenization used in training data (which would raise multiple B- for all sub-tokens in one pre-token)
My solution: In the TokenClassificationPipeline, use the pre-tokenizer to mark is_subword for all non-first sub-tokens to aggregate them in gather_pre_entities.
run_ner.py shall support chunks during training. For example, XLM-Roberta-Base supports 512 as the maximum sequence length, and we might suffer with memory concern to limit the sequence length during training for large models. In this case, it might be better to split long training data into chunks (with certain overlapping) to fully utilize the information instead of truncation.
For label_all_tokens argument in run_ner.py, it means we currently have 2 options for sub-token labelling: first and all. It might be better to provide a last option to better support for Decoder-only models. See Fine Tuned GPT2 model performs very poorly on token classification task #15389 (comment)

Included in Feature request Section.

I currently implement some parts of them in some private confidential codes during work. I would help if repo maintainers think it worth doing.

The text was updated successfully, but these errors were encountered:

ain-soph added the Feature request Request for a new feature label Sep 18, 2024

Provide feedback