-
Notifications
You must be signed in to change notification settings - Fork 26.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LlamaTokenizer has no pad
token, leading to failure during batch-tokenization
#22312
Comments
pad
, bos
or eos
token, leading to failure during batch-tokenizationpad
token, leading to failure during batch-tokenization
I don't see padding token set anywhere:
A bunch of LLaMa libraries seem to be setting the IDs from the sentencepiece For me, running the following yields:
...which makes me believe the original tokenizer does not have a pad token? This is confirmed by the following:
Additional confirmation:
|
Hey, indeed the original sentencepiece model does not have a padding token. You can probably pad using the |
Yes, I don't think the original model has a padding token. The same code with GPT-2 will fail, you need to add the pad token yourself as indicated by the error message. |
So attempting to set the PAD token as the EOS token (i.e.
Error:
|
Can you share a link on how GPT2 does it? |
I can confirm that the following works:
|
Glad that it's now working. As an explanation: the error arising when using In the working example, I think second print of pad token should show: |
Note that the EOS token returned by |
There is also a weird issue of increase in vocab size depending on how we add the pad token. Method 1:
Method 2 Why is this discrepancy between Downstream issues: |
I think #22402 should fix this? |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
I am sorry if it is a wrong question, but don't we need padding token to train model with bs > 1, or are they concatenating sentences together, separated by eos token while training? |
My general understanding for bs > 1, we need to pad during finetuning. However, in pretraining the input text is set to max-length -- you can think of a sliding window over a large text corpora. |
Exactly! This was fixed in #22402 so keeping it closed! |
Seems as if this discrepancy is done intentionally. With
Output
What's the reasoning behind the distinction of the two methods? |
Hey @christoukmaji , this kind of question should be asked on the forum. |
I think that |
So what is the difference between the two and what would be the appropriate practice between the two? |
Method 1 does not really work if you want to have a different token for padding and >>> from transformers import LlamaTokenizer, LlamaForCausalLM
>>> tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf")
>>> tokenizer.pad_token='[PAD]'
>>> tokenizer.pad_token
['PAD']
>>> tokenizer.pad_token_id
0
>>> tokenizer.unk_token_id
0 The pad tokens was not |
the solution suggested here doesn't work afaik if the model doesn't have that token, right? |
Given recent release of Llama2,, and in the light of the fact that resizing from 32K to 32K+1 can make inference and training slower, will support |
Curious what does padding_index=-1 mean and how does it solve the problem?
-----
Brando Miranda
Ph.D. Student
Computer Science, Stanford University
EDGE Scholar, Stanford University
***@***.***
website: https://brando90.github.io/brandomiranda/home.html
mentorship opportunities: https://brando90.github.io/brandomiranda/prospective-collaborations.html
On Jul 25, 2023, at 9:48 AM, Arthur ***@***.***> wrote:
Given recent release of Llama2,, and in the light of the fact that resizing from 32K to 32K+1 can make inference and training slower, will support padding_index=-1. I'll be working on this soon!
—
Reply to this email directly, view it on GitHub<#22312 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAOE6LRKD5BFJZ6VV5KVVALXR72FVANCNFSM6AAAAAAWDZNEHI>.
You are receiving this because you commented.Message ID: ***@***.***>
|
If you set the padding index of the token embedding layer to -1, you don't need to change the size of the vocab, neither for the model nor for the tokenizer. The embeding layer will send zeros when it will see padding token, as it is supposed to and as it is implemented in the original Llama codebase! |
If you want to follow the advances: #25088 |
@ArthurZucker is the padding problem solved, how we have to set pad token |
Hey! PR is not merged yet, should be by the end of the week.! |
great , thank you |
@ArthurZucker looks like it's merged now — thanks for fixing this! The PR seems to add |
Yes! The idea is that depending on your hardware, you should choose a |
I guess what's unclear is how
I thought the problem here was that we can't add a padding token without going to 32K+1, and using an existing token such as |
The whole reason for having
|
Hi, just discovered this today and I would like to ask how should I set my FYI, I'm currently using a single RTX 4090.
|
Not really, as |
@ArthurZucker I still don't understand why setting |
Well, that is because sometimes you need the |
@ArthurZucker |
Yes. But when you set the padding token to be pos, the attention mask will still be 1 for pos token and 0 for the padding tokens. It's like somehow the tokenizer knows the padding tokens and ignores them regardless of what value they are set to. |
Can anyone summarise the best practice so far for batched inference (like for a batched chat completion)? 1) How to set the padding token? Whether to set it to a new |
System Info
System info:
main
branch, installed via:pip install git+https://github.com/huggingface/transformers
on 22nd March 2023Who can help?
@ArthurZucker @sgugger @zphang
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The above statement raises an issue:
Expected behavior
The following code should work:
The text was updated successfully, but these errors were encountered: