You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
run_ner.py in examples are requiring data of pre-tokenized words, like conll2003 dataset, where texts have been splitted into words. However, the TokenClassificationPipeline has no such pre-tokenization but simply aggregate pre-entities by checking is_subword using whitespace separator.
This might work somehow on English words, but would fail when it is inconsistent with pre-tokenization used in training data (which would raise multiple B- for all sub-tokens in one pre-token) My solution: In the TokenClassificationPipeline, use the pre-tokenizer to mark is_subword for all non-first sub-tokens to aggregate them in gather_pre_entities.
run_ner.py shall support chunks during training. For example, XLM-Roberta-Base supports 512 as the maximum sequence length, and we might suffer with memory concern to limit the sequence length during training for large models. In this case, it might be better to split long training data into chunks (with certain overlapping) to fully utilize the information instead of truncation.
Feature request
This might work somehow on English words, but would fail when it is inconsistent with pre-tokenization used in training data (which would raise multiple
B-
for all sub-tokens in one pre-token)My solution: In the
TokenClassificationPipeline
, use the pre-tokenizer to markis_subword
for all non-first sub-tokens to aggregate them ingather_pre_entities
.run_ner.py
shall support chunks during training. For example, XLM-Roberta-Base supports 512 as the maximum sequence length, and we might suffer with memory concern to limit the sequence length during training for large models. In this case, it might be better to split long training data into chunks (with certain overlapping) to fully utilize the information instead of truncation.label_all_tokens
argument inrun_ner.py
, it means we currently have 2 options for sub-token labelling:first
andall
. It might be better to provide alast
option to better support for Decoder-only models. See Fine Tuned GPT2 model performs very poorly on token classification task #15389 (comment)Motivation
Included in Feature request Section.
Your contribution
I currently implement some parts of them in some private confidential codes during work. I would help if repo maintainers think it worth doing.
The text was updated successfully, but these errors were encountered: