Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LlamaTokenizer has no pad token, leading to failure during batch-tokenization #22312

Closed
2 of 4 tasks
adivekar-utexas opened this issue Mar 22, 2023 · 38 comments · Fixed by #25088
Closed
2 of 4 tasks

LlamaTokenizer has no pad token, leading to failure during batch-tokenization #22312

adivekar-utexas opened this issue Mar 22, 2023 · 38 comments · Fixed by #25088

Comments

@adivekar-utexas
Copy link

System Info

System info:

  • Code: Current main branch, installed via: pip install git+https://github.com/huggingface/transformers on 22nd March 2023

Who can help?

@ArthurZucker @sgugger @zphang

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  • Code to reproduce:
from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf")

print(repr(tokenizer.pad_token)) ## None
print(repr(tokenizer.bos_token)) ## ''
print(repr(tokenizer.eos_token)) ## ''
  • Where this causes an issue:
batch = tokenizer(
    [
        "Singer Billy Joel yesterday ",
        "The primary use of LLaMA is research on large language "
    ],
    return_tensors="pt",
    padding=True
)

The above statement raises an issue:

Using pad_token, but it is not set yet.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[53], line 1
----> 1 batch = tokenizer(
      2     [
      3         "Singer Billy Joel yesterday ",
      4         "The primary use of LLaMA is research on large language "
      5     ],
      6     return_tensors="pt",
      7     padding=True
      8 )

File /home/ec2-user/anaconda3/envs/llm-gen/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2531, in PreTrainedTokenizerBase.__call__(self, text, text_pair, text_target, text_pair_target, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2529     if not self._in_target_context_manager:
   2530         self._switch_to_input_mode()
-> 2531     encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
   2532 if text_target is not None:
   2533     self._switch_to_target_mode()

File /home/ec2-user/anaconda3/envs/llm-gen/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2617, in PreTrainedTokenizerBase._call_one(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2612         raise ValueError(
   2613             f"batch length of `text`: {len(text)} does not match batch length of `text_pair`:"
   2614             f" {len(text_pair)}."
   2615         )
   2616     batch_text_or_text_pairs = list(zip(text, text_pair)) if text_pair is not None else text
-> 2617     return self.batch_encode_plus(
   2618         batch_text_or_text_pairs=batch_text_or_text_pairs,
   2619         add_special_tokens=add_special_tokens,
   2620         padding=padding,
   2621         truncation=truncation,
   2622         max_length=max_length,
   2623         stride=stride,
   2624         is_split_into_words=is_split_into_words,
   2625         pad_to_multiple_of=pad_to_multiple_of,
   2626         return_tensors=return_tensors,
   2627         return_token_type_ids=return_token_type_ids,
   2628         return_attention_mask=return_attention_mask,
   2629         return_overflowing_tokens=return_overflowing_tokens,
   2630         return_special_tokens_mask=return_special_tokens_mask,
   2631         return_offsets_mapping=return_offsets_mapping,
   2632         return_length=return_length,
   2633         verbose=verbose,
   2634         **kwargs,
   2635     )
   2636 else:
   2637     return self.encode_plus(
   2638         text=text,
   2639         text_pair=text_pair,
   (...)
   2655         **kwargs,
   2656     )

File /home/ec2-user/anaconda3/envs/llm-gen/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2799, in PreTrainedTokenizerBase.batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2782 """
   2783 Tokenize and prepare for the model a list of sequences or a list of pairs of sequences.
   2784 
   (...)
   2795         details in `encode_plus`).
   2796 """
   2798 # Backward compatibility for 'truncation_strategy', 'pad_to_max_length'
-> 2799 padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
   2800     padding=padding,
   2801     truncation=truncation,
   2802     max_length=max_length,
   2803     pad_to_multiple_of=pad_to_multiple_of,
   2804     verbose=verbose,
   2805     **kwargs,
   2806 )
   2808 return self._batch_encode_plus(
   2809     batch_text_or_text_pairs=batch_text_or_text_pairs,
   2810     add_special_tokens=add_special_tokens,
   (...)
   2825     **kwargs,
   2826 )

File /home/ec2-user/anaconda3/envs/llm-gen/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2436, in PreTrainedTokenizerBase._get_padding_truncation_strategies(self, padding, truncation, max_length, pad_to_multiple_of, verbose, **kwargs)
   2434 # Test if we have a padding token
   2435 if padding_strategy != PaddingStrategy.DO_NOT_PAD and (not self.pad_token or self.pad_token_id < 0):
-> 2436     raise ValueError(
   2437         "Asking to pad but the tokenizer does not have a padding token. "
   2438         "Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` "
   2439         "or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`."
   2440     )
   2442 # Check that we will truncate to a multiple of pad_to_multiple_of if both are provided
   2443 if (
   2444     truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE
   2445     and padding_strategy != PaddingStrategy.DO_NOT_PAD
   (...)
   2448     and (max_length % pad_to_multiple_of != 0)
   2449 ):

ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

Expected behavior

The following code should work:

from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf")

batch = tokenizer(
    [
        "Singer Billy Joel yesterday ",
        "The primary use of LLaMA is research on large language "
    ],
    return_tensors="pt",
    padding=True
)
@adivekar-utexas adivekar-utexas changed the title LlamaTokenizer has no pad, bos or eos token, leading to failure during batch-tokenization LlamaTokenizer has no pad token, leading to failure during batch-tokenization Mar 22, 2023
@adivekar-utexas
Copy link
Author

adivekar-utexas commented Mar 22, 2023

  • Possible root cause:

I don't see padding token set anywhere:

A bunch of LLaMa libraries seem to be setting the IDs from the sentencepiece tokenizer.model: https://github.com/markasoftware/llama-cpu/blob/main/llama/tokenizer.py#L24

For me, running the following yields:

>>> print(sp_model.bos_id(), sp_model.eos_id(), sp_model.pad_id())
1 2 -1

...which makes me believe the original tokenizer does not have a pad token? This is confirmed by the following:

sp_model.id_to_piece(1)  ## '<s>', which is the bos token for LLaMa
sp_model.id_to_piece(2)  ## '</s>', which is the eos token for LLaMa
sp_model.id_to_piece(-1)  ## Throws: IndexError: piece id is out of range.

Additional confirmation:

vocab: Dict[str, int] = {sp_model.id_to_piece(id): id for id in range(sp_model.get_piece_size())}
print(vocab['<s>'])  ##  1
print(vocab['</s>'])  ##  2
print(vocab['<unk>'])  ##  0
print(vocab['<pad>'])  ##  KeyError: '<pad>'

@ArthurZucker
Copy link
Collaborator

Hey, indeed the original sentencepiece model does not have a padding token. You can probably pad using the eos_token like it is done for GPT2, need to check what is mentioned on the paper, but the llama code does not use thepad_token it seems.

@sgugger
Copy link
Collaborator

sgugger commented Mar 22, 2023

Yes, I don't think the original model has a padding token. The same code with GPT-2 will fail, you need to add the pad token yourself as indicated by the error message.

@adivekar-utexas
Copy link
Author

adivekar-utexas commented Mar 22, 2023

So attempting to set the PAD token as the EOS token (i.e. '') fails with the same error message:

from transformers import LlamaTokenizer, LlamaForCausalLM
tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf")

print(repr(tokenizer.pad_token)) ## None
print(repr(tokenizer.bos_token)) ## ''
print(repr(tokenizer.eos_token)) ## ''
print()
tokenizer.pad_token = tokenizer.eos_token

print(repr(tokenizer.pad_token)) ## ''
print(repr(tokenizer.bos_token)) ## ''
print(repr(tokenizer.eos_token)) ## ''


batch = tokenizer(
    [
        "Singer Billy Joel yesterday ",
        "The primary use of LLaMA is research on large language "
    ],
    return_tensors="pt",
    padding=True
)

Error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[61], line 1
----> 1 batch = tokenizer(
      2     [
      3         "Singer Billy Joel yesterday ",
      4         "The primary use of LLaMA is research on large language "
      5     ],
      6     return_tensors="pt",
      7     padding=True
      8 )

File /home/ec2-user/anaconda3/envs/llm-gen/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2531, in PreTrainedTokenizerBase.__call__(self, text, text_pair, text_target, text_pair_target, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2529     if not self._in_target_context_manager:
   2530         self._switch_to_input_mode()
-> 2531     encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
   2532 if text_target is not None:
   2533     self._switch_to_target_mode()

File /home/ec2-user/anaconda3/envs/llm-gen/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2617, in PreTrainedTokenizerBase._call_one(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2612         raise ValueError(
   2613             f"batch length of `text`: {len(text)} does not match batch length of `text_pair`:"
   2614             f" {len(text_pair)}."
   2615         )
   2616     batch_text_or_text_pairs = list(zip(text, text_pair)) if text_pair is not None else text
-> 2617     return self.batch_encode_plus(
   2618         batch_text_or_text_pairs=batch_text_or_text_pairs,
   2619         add_special_tokens=add_special_tokens,
   2620         padding=padding,
   2621         truncation=truncation,
   2622         max_length=max_length,
   2623         stride=stride,
   2624         is_split_into_words=is_split_into_words,
   2625         pad_to_multiple_of=pad_to_multiple_of,
   2626         return_tensors=return_tensors,
   2627         return_token_type_ids=return_token_type_ids,
   2628         return_attention_mask=return_attention_mask,
   2629         return_overflowing_tokens=return_overflowing_tokens,
   2630         return_special_tokens_mask=return_special_tokens_mask,
   2631         return_offsets_mapping=return_offsets_mapping,
   2632         return_length=return_length,
   2633         verbose=verbose,
   2634         **kwargs,
   2635     )
   2636 else:
   2637     return self.encode_plus(
   2638         text=text,
   2639         text_pair=text_pair,
   (...)
   2655         **kwargs,
   2656     )

File /home/ec2-user/anaconda3/envs/llm-gen/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2799, in PreTrainedTokenizerBase.batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2782 """
   2783 Tokenize and prepare for the model a list of sequences or a list of pairs of sequences.
   2784 
   (...)
   2795         details in `encode_plus`).
   2796 """
   2798 # Backward compatibility for 'truncation_strategy', 'pad_to_max_length'
-> 2799 padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
   2800     padding=padding,
   2801     truncation=truncation,
   2802     max_length=max_length,
   2803     pad_to_multiple_of=pad_to_multiple_of,
   2804     verbose=verbose,
   2805     **kwargs,
   2806 )
   2808 return self._batch_encode_plus(
   2809     batch_text_or_text_pairs=batch_text_or_text_pairs,
   2810     add_special_tokens=add_special_tokens,
   (...)
   2825     **kwargs,
   2826 )

File /home/ec2-user/anaconda3/envs/llm-gen/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2436, in PreTrainedTokenizerBase._get_padding_truncation_strategies(self, padding, truncation, max_length, pad_to_multiple_of, verbose, **kwargs)
   2434 # Test if we have a padding token
   2435 if padding_strategy != PaddingStrategy.DO_NOT_PAD and (not self.pad_token or self.pad_token_id < 0):
-> 2436     raise ValueError(
   2437         "Asking to pad but the tokenizer does not have a padding token. "
   2438         "Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` "
   2439         "or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`."
   2440     )
   2442 # Check that we will truncate to a multiple of pad_to_multiple_of if both are provided
   2443 if (
   2444     truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE
   2445     and padding_strategy != PaddingStrategy.DO_NOT_PAD
   (...)
   2448     and (max_length % pad_to_multiple_of != 0)
   2449 ):

ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

@adivekar-utexas
Copy link
Author

Can you share a link on how GPT2 does it?

@adivekar-utexas
Copy link
Author

I can confirm that the following works:

from transformers import LlamaTokenizer, LlamaForCausalLM
tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf")

print(repr(tokenizer.pad_token)) ## None
print(repr(tokenizer.bos_token)) ## ''
print(repr(tokenizer.eos_token)) ## ''
print()
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

print(repr(tokenizer.pad_token)) ## ''
print(repr(tokenizer.bos_token)) ## ''
print(repr(tokenizer.eos_token)) ## ''

batch = tokenizer(
    [
        "Singer Billy Joel yesterday ",
        "The primary use of LLaMA is research on large language "
    ],
    return_tensors="pt",
    padding=True
)

@amyeroberts
Copy link
Collaborator

Glad that it's now working.

As an explanation: the error arising when using tokenizer.pad_token = tokenizer.eos_token is because self.pad_token is set as an empty string which evaluates as False in this check. This seems like an expected exception as it's not possible to pad with an empty string.

In the working example, I think second print of pad token should show:
print(repr(tokenizer.pad_token)) ## '[PAD]'

@sgugger
Copy link
Collaborator

sgugger commented Mar 22, 2023

Note that the EOS token returned by tokenizer.eos_token is wrong in any case (this is a known issue and @ArthurZucker should fix this). The EOS token is not "" but "<s>". Once this issue is fixed, doing tokenizer.pad_token = tokenizer.eos_token will be possible.

@suhaskowshik
Copy link

suhaskowshik commented Mar 24, 2023

There is also a weird issue of increase in vocab size depending on how we add the pad token.

Method 1:

from transformers import LlamaTokenizer, LlamaForCausalLM
tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf")
tokenizer.pad_token='[PAD]'
print(f"pad_token_id={tokenizer.pad_token_id}") #prints 0
print(f"vocab length={len(tokenizer.get_vocab())}") #prints 32000

Method 2
from transformers import LlamaTokenizer, LlamaForCausalLM
tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf")
num_spl_tokens_added=tokenizer.add_special_tokens({'pad_token': '[PAD]'}) #returns 1
print(f"pad_token_id={tokenizer.pad_token_id}") #prints 32000
print(f"vocab length={len(tokenizer.get_vocab())}") #prints 32001

Why is this discrepancy between tokenizer.add_special_tokens({'pad_token': '[PAD]'}) and tokenizer.pad_token='[PAD]' ?

Downstream issues:
The Stanford Alpaca model independently trained on decapoda-research/llama-7b-hf at "chavinlo/alpaca-native" uses tokenizer.add_special_tokens({'pad_token': '[PAD]'}) and hence the model's vocab size is set to 32001.

@yukw777
Copy link

yukw777 commented Mar 27, 2023

I think #22402 should fix this?

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@basujindal
Copy link

I am sorry if it is a wrong question, but don't we need padding token to train model with bs > 1, or are they concatenating sentences together, separated by eos token while training?

@Axe--
Copy link

Axe-- commented May 21, 2023

@basujindal

My general understanding for bs > 1, we need to pad during finetuning. However, in pretraining the input text is set to max-length -- you can think of a sliding window over a large text corpora.

@ArthurZucker
Copy link
Collaborator

Exactly! This was fixed in #22402 so keeping it closed!

@christoukmaji
Copy link
Contributor

christoukmaji commented May 29, 2023

There is also a weird issue of increase in vocab size depending on how we add the pad token.

Method 1:

from transformers import LlamaTokenizer, LlamaForCausalLM tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf") tokenizer.pad_token='[PAD]' print(f"pad_token_id={tokenizer.pad_token_id}") #prints 0 print(f"vocab length={len(tokenizer.get_vocab())}") #prints 32000

Method 2 from transformers import LlamaTokenizer, LlamaForCausalLM tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf") num_spl_tokens_added=tokenizer.add_special_tokens({'pad_token': '[PAD]'}) #returns 1 print(f"pad_token_id={tokenizer.pad_token_id}") #prints 32000 print(f"vocab length={len(tokenizer.get_vocab())}") #prints 32001

Why is this discrepancy between tokenizer.add_special_tokens({'pad_token': '[PAD]'}) and tokenizer.pad_token='[PAD]' ?

Downstream issues: The Stanford Alpaca model independently trained on decapoda-research/llama-7b-hf at "chavinlo/alpaca-native" uses tokenizer.add_special_tokens({'pad_token': '[PAD]'}) and hence the model's vocab size is set to 32001.

Seems as if this discrepancy is done intentionally. With tranformers==4.30.0.dev0,

from transformers import (
    LlamaForCausalLM, 
    LlamaTokenizer
)
tokenizer = LlamaTokenizer.from_pretrained("/root/HF_llama")
model = LlamaForCausalLM.from_pretrained("/root/HF_llama").to("cuda")

tokenized_text = tokenizer(["some text", "this will cause padding"], padding = True, return_tensors='pt').to("cuda")
model.generate(tokenized_text['input_ids'])

Output

ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token`  

`(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': 

\'[PAD]'})`.

What's the reasoning behind the distinction of the two methods?

@ArthurZucker
Copy link
Collaborator

Hey @christoukmaji , this kind of question should be asked on the forum.
The first method will set pad_token_id to 2 while the other will give a different index.

@Sushentsev
Copy link

Note that the EOS token returned by tokenizer.eos_token is wrong in any case (this is a known issue and @ArthurZucker should fix this). The EOS token is not "" but "<s>". Once this issue is fixed, doing tokenizer.pad_token = tokenizer.eos_token will be possible.

I think that bos_token = "<s>" and eos_token = "</s>", you have a mistake.

@JaejinCho
Copy link

There is also a weird issue of increase in vocab size depending on how we add the pad token.

Method 1:

from transformers import LlamaTokenizer, LlamaForCausalLM tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf") tokenizer.pad_token='[PAD]' print(f"pad_token_id={tokenizer.pad_token_id}") #prints 0 print(f"vocab length={len(tokenizer.get_vocab())}") #prints 32000

Method 2 from transformers import LlamaTokenizer, LlamaForCausalLM tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf") num_spl_tokens_added=tokenizer.add_special_tokens({'pad_token': '[PAD]'}) #returns 1 print(f"pad_token_id={tokenizer.pad_token_id}") #prints 32000 print(f"vocab length={len(tokenizer.get_vocab())}") #prints 32001

Why is this discrepancy between tokenizer.add_special_tokens({'pad_token': '[PAD]'}) and tokenizer.pad_token='[PAD]' ?

Downstream issues: The Stanford Alpaca model independently trained on decapoda-research/llama-7b-hf at "chavinlo/alpaca-native" uses tokenizer.add_special_tokens({'pad_token': '[PAD]'}) and hence the model's vocab size is set to 32001.

So what is the difference between the two and what would be the appropriate practice between the two?

@ArthurZucker
Copy link
Collaborator

Method 1 does not really work if you want to have a different token for padding and <unk>:

>>> from transformers import LlamaTokenizer, LlamaForCausalLM
>>> tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf")
>>> tokenizer.pad_token='[PAD]' 
>>> tokenizer.pad_token
['PAD']
>>> tokenizer.pad_token_id
0
>>> tokenizer.unk_token_id
0

The pad tokens was not added but just set, which means it is unkown and will be always encoded as 0.

@brando90
Copy link

brando90 commented Jul 8, 2023

the solution suggested here doesn't work afaik if the model doesn't have that token, right?

see: https://stackoverflow.com/questions/76633368/why-does-the-falcon-qlora-tutorial-code-use-eos-token-as-pad-token/76639568#76639568

@ArthurZucker
Copy link
Collaborator

Given recent release of Llama2,, and in the light of the fact that resizing from 32K to 32K+1 can make inference and training slower, will support padding_index=-1. I'll be working on this soon!

@brando90
Copy link

brando90 commented Jul 25, 2023 via email

@ArthurZucker
Copy link
Collaborator

If you set the padding index of the token embedding layer to -1, you don't need to change the size of the vocab, neither for the model nor for the tokenizer. The embeding layer will send zeros when it will see padding token, as it is supposed to and as it is implemented in the original Llama codebase!

@ArthurZucker
Copy link
Collaborator

If you want to follow the advances: #25088

@kaushal-idx
Copy link

@ArthurZucker is the padding problem solved, how we have to set pad token

@ArthurZucker
Copy link
Collaborator

Hey! PR is not merged yet, should be by the end of the week.!

@kaushal-idx
Copy link

great , thank you

@kwood
Copy link

kwood commented Aug 17, 2023

@ArthurZucker looks like it's merged now — thanks for fixing this!

The PR seems to add pad_to_multiple_of — it's a little unclear to me how that fixes this issue. Will llama-2's tokenizer work with batch inference out of the box with this change, or do we need to do something to configure the padding still?

@ArthurZucker
Copy link
Collaborator

Yes! The idea is that depending on your hardware, you should choose a pad_to_multiple_of value. This is for people who need performance optimisation. Otherwise, just add a padding token and resize normally. Gonna add a little bit of doc today about this!

@kwood
Copy link

kwood commented Aug 18, 2023

I guess what's unclear is how pad_to_multiple_of addresses the issue you highlighted in your previous comment:

in the light of the fact that resizing from 32K to 32K+1 can make inference and training slower, will support padding_index=-1

I thought the problem here was that we can't add a padding token without going to 32K+1, and using an existing token such as eos or unk is sub-optimal because that was not how the model was trained.

@ArthurZucker
Copy link
Collaborator

The whole reason for having pad_to_multiple_of is that it will not slow down inference if you pad the embedding matrix to a multiple that is optimised for you sm. The idea is that if you have a gpu optimized on 32K it will be as fast on a 32064 (just an example) not slowing the training down. Did you try to read the linked page in the documentation of pad_to_multiple_of?

If set will pad the embedding matrix to a multiple of the provided value.
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
>= 7.5 (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128. For more
details about this, or help on choosing the correct value for resizing, refer to this guide:
https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc

@cringelord000222
Copy link

The whole reason for having pad_to_multiple_of is that it will not slow down inference if you pad the embedding matrix to a multiple that is optimised for you sm. The idea is that if you have a gpu optimized on 32K it will be as fast on a 32064 (just an example) not slowing the training down. Did you try to read the linked page in the documentation of pad_to_multiple_of?

If set will pad the embedding matrix to a multiple of the provided value.
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
>= 7.5 (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128. For more
details about this, or help on choosing the correct value for resizing, refer to this guide:
https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc

Hi, just discovered this today and I would like to ask how should I set my pad_to_multiple_of to? Should it be 16 or 8?

FYI, I'm currently using a single RTX 4090.
I usually loads my model in 4-bit using the load_in_4bit=True from bitsandbytes.
This is my snippet:

model_id="meta-llama/Llama-2-13b-chat-hf"
tokenizer=AutoTokenizer.from_pretrained(model_id)
tokenizer.add_special_tokens({"pad_token":"<pad>"})

quant_config=BitsAndBytesConfig(
    # load_in_8bit=True,
    # llm_int8_threshold=6.0,
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_type=torch.bfloat16,
)
model = LlamaForCausalLM.from_pretrained(
    model_id,
    quantization_config=quant_config,
    device_map="auto",
    torch_dtype=torch.float16,
)
model.resize_token_embeddings(len(tokenizer),pad_to_multiple_of=8)

@ArthurZucker
Copy link
Collaborator

Not really, as int4 is not included o the table, would guess 32, but it's a pure guess. Feel free to ask this on the forum, where the community will benefit from this question!

@bodasadallah
Copy link
Contributor

@ArthurZucker I still don't understand why setting pad_token to any speical_token (for ex. eos) is not optimal? as the tokenzier will generate an attention mask that ignore padding tokens.

@ArthurZucker
Copy link
Collaborator

Well, that is because sometimes you need the eos to be attended to. If you train you need to make sure the model attends to the eos no? If you pad with eos then eos is pad and is ignored. Does that make senses?

@RaminZi
Copy link

RaminZi commented Jan 30, 2024

@ArthurZucker
Can someone please explain this generation warning: "Setting pad_token_id to eos_token_id:2 for open-end generation."
What is the difference between pad_token_id in model.generate() and the pad_token of the tokenizer? Why do we have to define both separately? Is the generation pad_token_id used for batched inference? If there's padding in the generation phase, why unlike tokenizer, we don't need to set padding side or length, and it only needs the pad_token_id?

@bodasadallah
Copy link
Contributor

Well, that is because sometimes you need the eos to be attended to. If you train you need to make sure the model attends to the eos no? If you pad with eos then eos is pad and is ignored. Does that make senses?

Yes. But when you set the padding token to be pos, the attention mask will still be 1 for pos token and 0 for the padding tokens. It's like somehow the tokenizer knows the padding tokens and ignores them regardless of what value they are set to.

@panFJCharlotte98
Copy link

Can anyone summarise the best practice so far for batched inference (like for a batched chat completion)? 1) How to set the padding token? Whether to set it to a new <PAD> token and resize the embedding layer or just set to the default EOS token (which is commonly used?) 2) How to set the padding side, left is by default reasonable with decoder-only models or we can try using right-padded to see which gives the better result?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.