Difference between slow and fast GPT2 tokenizers #1363

goerch · 2023-10-10T03:37:43Z

Please see this comment by @jploski.

ArthurZucker · 2023-10-10T08:44:48Z

Hey! Could you share a full reproducer? Would help me a lot to have the output of transformers-cli env as well. 🤗

huggingface/tokenizers#1363)

jploski · 2023-10-10T10:45:46Z

Hey! Could you share a full reproducer? Would help me a lot to have the output of transformers-cli env as well. 🤗

https://github.com/jploski/llama.cpp/tree/hf-issue-1363/tests/hf-issue-1363

(env output included in README.md)

goerch · 2023-10-10T16:56:13Z

This comment is explaining the problem. Not sure if I should close this issue then?

ArthurZucker · 2023-10-10T18:33:55Z

Probably yes! Unless the changes need to be propagated to the slow tokenizer?

goerch · 2023-10-10T18:35:25Z

Thanks!

cebtenzzre · 2023-10-10T19:51:17Z

Hm, as a transformers user I would normally assume that the slow and fast tokenizers are both correct. If only the fast tokenizer is correct, this should be documented somewhere. (Maybe it is, and I'm just not aware.)

goerch · 2023-10-10T20:00:48Z

If only the fast tokenizer is correct, this should be documented somewhere.

I have been thinking about that too, but I accepted that the fast GPT2 tokenizer offers more features than the original one and Falcon used them (unfortunately for us). The remark about the documentation is correct (and I certainly would like to ask a lot more questions about the serialization formats of HF;).

jploski · 2023-10-10T20:11:06Z

Hm, as a transformers user I would normally assume that the slow and fast tokenizers are both correct. If only the fast tokenizer is correct, this should be documented somewhere. (Maybe it is, and I'm just not aware.)

I agree, these differences in functionality are confusing and would deserve at least a mention in the documentation (https://huggingface.co/docs/transformers/main_classes/tokenizer). I note that for the *TokenizerFast even the tokenizer_file parameter is currently entirely undocumented.

ArthurZucker · 2023-10-11T06:35:42Z

We are trying to get the same results for fast and slow, but in this specific case Falcon used the GPT2TokenizerFast with a custom template processor. We can add a FalconTokenizer in transformers to be equivalent, but I believe the best we can do is:

add support for template processing in transformers (using jinja?)
update the documentation where you feel it's needed.

Would any of you like to open a PR to update the documentation however you feel? You can ping me for review 🤗

goerch changed the title ~~Difference between slow and fast HF tokenizers~~ Difference between slow and fast HF GPT2 tokenizers Oct 10, 2023

goerch changed the title ~~Difference between slow and fast HF GPT2 tokenizers~~ Difference between slow and fast GPT2 tokenizers Oct 10, 2023

jploski added a commit to jploski/llama.cpp that referenced this issue Oct 10, 2023

Added test code to help HF reproduce GPT-NeoX / Falcon tokenizer issue (

bf31e83

huggingface/tokenizers#1363)

goerch mentioned this issue Oct 10, 2023

Minor improvements in GPT2 tokenizer ggerganov/llama.cpp#3567

Merged

jploski mentioned this issue Oct 10, 2023

llama.cpp BPE tokenization of wiki.test does not match the HF tokenization ggerganov/llama.cpp#3502

Closed

goerch closed this as completed Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difference between slow and fast GPT2 tokenizers #1363

Difference between slow and fast GPT2 tokenizers #1363

goerch commented Oct 10, 2023

ArthurZucker commented Oct 10, 2023

jploski commented Oct 10, 2023

goerch commented Oct 10, 2023 •

edited

Loading

ArthurZucker commented Oct 10, 2023

goerch commented Oct 10, 2023

cebtenzzre commented Oct 10, 2023 •

edited

Loading

goerch commented Oct 10, 2023

jploski commented Oct 10, 2023

ArthurZucker commented Oct 11, 2023

Difference between slow and fast GPT2 tokenizers #1363

Difference between slow and fast GPT2 tokenizers #1363

Comments

goerch commented Oct 10, 2023

ArthurZucker commented Oct 10, 2023

jploski commented Oct 10, 2023

goerch commented Oct 10, 2023 • edited Loading

ArthurZucker commented Oct 10, 2023

goerch commented Oct 10, 2023

cebtenzzre commented Oct 10, 2023 • edited Loading

goerch commented Oct 10, 2023

jploski commented Oct 10, 2023

ArthurZucker commented Oct 11, 2023

goerch commented Oct 10, 2023 •

edited

Loading

cebtenzzre commented Oct 10, 2023 •

edited

Loading