Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference between slow and fast GPT2 tokenizers #1363

Closed
goerch opened this issue Oct 10, 2023 · 9 comments
Closed

Difference between slow and fast GPT2 tokenizers #1363

goerch opened this issue Oct 10, 2023 · 9 comments

Comments

@goerch
Copy link

goerch commented Oct 10, 2023

Please see this comment by @jploski.

@goerch goerch changed the title Difference between slow and fast HF tokenizers Difference between slow and fast HF GPT2 tokenizers Oct 10, 2023
@goerch goerch changed the title Difference between slow and fast HF GPT2 tokenizers Difference between slow and fast GPT2 tokenizers Oct 10, 2023
@ArthurZucker
Copy link
Collaborator

Hey! Could you share a full reproducer? Would help me a lot to have the output of transformers-cli env as well. 🤗

@jploski
Copy link

jploski commented Oct 10, 2023

Hey! Could you share a full reproducer? Would help me a lot to have the output of transformers-cli env as well. 🤗

https://github.com/jploski/llama.cpp/tree/hf-issue-1363/tests/hf-issue-1363

(env output included in README.md)

@goerch
Copy link
Author

goerch commented Oct 10, 2023

This comment is explaining the problem. Not sure if I should close this issue then?

@ArthurZucker
Copy link
Collaborator

Probably yes! Unless the changes need to be propagated to the slow tokenizer?

@goerch
Copy link
Author

goerch commented Oct 10, 2023

Thanks!

@goerch goerch closed this as completed Oct 10, 2023
@cebtenzzre
Copy link

cebtenzzre commented Oct 10, 2023

Hm, as a transformers user I would normally assume that the slow and fast tokenizers are both correct. If only the fast tokenizer is correct, this should be documented somewhere. (Maybe it is, and I'm just not aware.)

@goerch
Copy link
Author

goerch commented Oct 10, 2023

If only the fast tokenizer is correct, this should be documented somewhere.

I have been thinking about that too, but I accepted that the fast GPT2 tokenizer offers more features than the original one and Falcon used them (unfortunately for us). The remark about the documentation is correct (and I certainly would like to ask a lot more questions about the serialization formats of HF;).

@jploski
Copy link

jploski commented Oct 10, 2023

Hm, as a transformers user I would normally assume that the slow and fast tokenizers are both correct. If only the fast tokenizer is correct, this should be documented somewhere. (Maybe it is, and I'm just not aware.)

I agree, these differences in functionality are confusing and would deserve at least a mention in the documentation (https://huggingface.co/docs/transformers/main_classes/tokenizer). I note that for the *TokenizerFast even the tokenizer_file parameter is currently entirely undocumented.

@ArthurZucker
Copy link
Collaborator

We are trying to get the same results for fast and slow, but in this specific case Falcon used the GPT2TokenizerFast with a custom template processor. We can add a FalconTokenizer in transformers to be equivalent, but I believe the best we can do is:

  • add support for template processing in transformers (using jinja?)
  • update the documentation where you feel it's needed.

Would any of you like to open a PR to update the documentation however you feel? You can ping me for review 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants