LlamaTokenizer has no `pad` token, leading to failure during batch-tokenization #22312

adivekar-utexas · 2023-03-22T13:08:03Z

System Info

System info:

Code: Current main branch, installed via: pip install git+https://github.com/huggingface/transformers on 22nd March 2023

Who can help?

@ArthurZucker @sgugger @zphang

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Code to reproduce:

from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf")

print(repr(tokenizer.pad_token)) ## None
print(repr(tokenizer.bos_token)) ## ''
print(repr(tokenizer.eos_token)) ## ''

Where this causes an issue:

batch = tokenizer(
    [
        "Singer Billy Joel yesterday ",
        "The primary use of LLaMA is research on large language "
    ],
    return_tensors="pt",
    padding=True
)

The above statement raises an issue:

Using pad_token, but it is not set yet.
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[53], line 1
----> 1 batch = tokenizer(
      2     [
      3         "Singer Billy Joel yesterday ",
      4         "The primary use of LLaMA is research on large language "
      5     ],
      6     return_tensors="pt",
      7     padding=True
      8 )

File /home/ec2-user/anaconda3/envs/llm-gen/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2531, in PreTrainedTokenizerBase.__call__(self, text, text_pair, text_target, text_pair_target, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2529     if not self._in_target_context_manager:
   2530         self._switch_to_input_mode()
-> 2531     encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
   2532 if text_target is not None:
   2533     self._switch_to_target_mode()

File /home/ec2-user/anaconda3/envs/llm-gen/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2617, in PreTrainedTokenizerBase._call_one(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2612         raise ValueError(
   2613             f"batch length of `text`: {len(text)} does not match batch length of `text_pair`:"
   2614             f" {len(text_pair)}."
   2615         )
   2616     batch_text_or_text_pairs = list(zip(text, text_pair)) if text_pair is not None else text
-> 2617     return self.batch_encode_plus(
   2618         batch_text_or_text_pairs=batch_text_or_text_pairs,
   2619         add_special_tokens=add_special_tokens,
   2620         padding=padding,
   2621         truncation=truncation,
   2622         max_length=max_length,
   2623         stride=stride,
   2624         is_split_into_words=is_split_into_words,
   2625         pad_to_multiple_of=pad_to_multiple_of,
   2626         return_tensors=return_tensors,
   2627         return_token_type_ids=return_token_type_ids,
   2628         return_attention_mask=return_attention_mask,
   2629         return_overflowing_tokens=return_overflowing_tokens,
   2630         return_special_tokens_mask=return_special_tokens_mask,
   2631         return_offsets_mapping=return_offsets_mapping,
   2632         return_length=return_length,
   2633         verbose=verbose,
   2634         **kwargs,
   2635     )
   2636 else:
   2637     return self.encode_plus(
   2638         text=text,
   2639         text_pair=text_pair,
   (...)
   2655         **kwargs,
   2656     )

File /home/ec2-user/anaconda3/envs/llm-gen/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2799, in PreTrainedTokenizerBase.batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2782 """
   2783 Tokenize and prepare for the model a list of sequences or a list of pairs of sequences.
   2784 
   (...)
   2795         details in `encode_plus`).
   2796 """
   2798 # Backward compatibility for 'truncation_strategy', 'pad_to_max_length'
-> 2799 padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
   2800     padding=padding,
   2801     truncation=truncation,
   2802     max_length=max_length,
   2803     pad_to_multiple_of=pad_to_multiple_of,
   2804     verbose=verbose,
   2805     **kwargs,
   2806 )
   2808 return self._batch_encode_plus(
   2809     batch_text_or_text_pairs=batch_text_or_text_pairs,
   2810     add_special_tokens=add_special_tokens,
   (...)
   2825     **kwargs,
   2826 )

File /home/ec2-user/anaconda3/envs/llm-gen/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2436, in PreTrainedTokenizerBase._get_padding_truncation_strategies(self, padding, truncation, max_length, pad_to_multiple_of, verbose, **kwargs)
   2434 # Test if we have a padding token
   2435 if padding_strategy != PaddingStrategy.DO_NOT_PAD and (not self.pad_token or self.pad_token_id < 0):
-> 2436     raise ValueError(
   2437         "Asking to pad but the tokenizer does not have a padding token. "
   2438         "Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` "
   2439         "or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`."
   2440     )
   2442 # Check that we will truncate to a multiple of pad_to_multiple_of if both are provided
   2443 if (
   2444     truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE
   2445     and padding_strategy != PaddingStrategy.DO_NOT_PAD
   (...)
   2448     and (max_length % pad_to_multiple_of != 0)
   2449 ):

ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

Expected behavior

The following code should work:

from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf")

batch = tokenizer(
    [
        "Singer Billy Joel yesterday ",
        "The primary use of LLaMA is research on large language "
    ],
    return_tensors="pt",
    padding=True
)

The text was updated successfully, but these errors were encountered:

adivekar-utexas · 2023-03-22T13:08:29Z

Possible root cause:

I don't see padding token set anywhere:

transformers/src/transformers/models/llama/convert_llama_weights_to_hf.py

Line 241 in c07a02a

"eos_token": "",

A bunch of LLaMa libraries seem to be setting the IDs from the sentencepiece tokenizer.model: https://github.com/markasoftware/llama-cpu/blob/main/llama/tokenizer.py#L24

For me, running the following yields:

>>> print(sp_model.bos_id(), sp_model.eos_id(), sp_model.pad_id())
1 2 -1

...which makes me believe the original tokenizer does not have a pad token? This is confirmed by the following:

sp_model.id_to_piece(1)  ## '<s>', which is the bos token for LLaMa
sp_model.id_to_piece(2)  ## '</s>', which is the eos token for LLaMa
sp_model.id_to_piece(-1)  ## Throws: IndexError: piece id is out of range.

Additional confirmation:

vocab: Dict[str, int] = {sp_model.id_to_piece(id): id for id in range(sp_model.get_piece_size())}
print(vocab['<s>'])  ##  1
print(vocab['</s>'])  ##  2
print(vocab['<unk>'])  ##  0
print(vocab['<pad>'])  ##  KeyError: '<pad>'

ArthurZucker · 2023-03-22T13:10:51Z

Hey, indeed the original sentencepiece model does not have a padding token. You can probably pad using the eos_token like it is done for GPT2, need to check what is mentioned on the paper, but the llama code does not use thepad_token it seems.

sgugger · 2023-03-22T13:12:41Z

Yes, I don't think the original model has a padding token. The same code with GPT-2 will fail, you need to add the pad token yourself as indicated by the error message.

adivekar-utexas · 2023-03-22T13:14:58Z

So attempting to set the PAD token as the EOS token (i.e. '') fails with the same error message:

from transformers import LlamaTokenizer, LlamaForCausalLM
tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf")

print(repr(tokenizer.pad_token)) ## None
print(repr(tokenizer.bos_token)) ## ''
print(repr(tokenizer.eos_token)) ## ''
print()
tokenizer.pad_token = tokenizer.eos_token

print(repr(tokenizer.pad_token)) ## ''
print(repr(tokenizer.bos_token)) ## ''
print(repr(tokenizer.eos_token)) ## ''


batch = tokenizer(
    [
        "Singer Billy Joel yesterday ",
        "The primary use of LLaMA is research on large language "
    ],
    return_tensors="pt",
    padding=True
)

Error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[61], line 1
----> 1 batch = tokenizer(
      2     [
      3         "Singer Billy Joel yesterday ",
      4         "The primary use of LLaMA is research on large language "
      5     ],
      6     return_tensors="pt",
      7     padding=True
      8 )

File /home/ec2-user/anaconda3/envs/llm-gen/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2531, in PreTrainedTokenizerBase.__call__(self, text, text_pair, text_target, text_pair_target, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2529     if not self._in_target_context_manager:
   2530         self._switch_to_input_mode()
-> 2531     encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
   2532 if text_target is not None:
   2533     self._switch_to_target_mode()

File /home/ec2-user/anaconda3/envs/llm-gen/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2617, in PreTrainedTokenizerBase._call_one(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2612         raise ValueError(
   2613             f"batch length of `text`: {len(text)} does not match batch length of `text_pair`:"
   2614             f" {len(text_pair)}."
   2615         )
   2616     batch_text_or_text_pairs = list(zip(text, text_pair)) if text_pair is not None else text
-> 2617     return self.batch_encode_plus(
   2618         batch_text_or_text_pairs=batch_text_or_text_pairs,
   2619         add_special_tokens=add_special_tokens,
   2620         padding=padding,
   2621         truncation=truncation,
   2622         max_length=max_length,
   2623         stride=stride,
   2624         is_split_into_words=is_split_into_words,
   2625         pad_to_multiple_of=pad_to_multiple_of,
   2626         return_tensors=return_tensors,
   2627         return_token_type_ids=return_token_type_ids,
   2628         return_attention_mask=return_attention_mask,
   2629         return_overflowing_tokens=return_overflowing_tokens,
   2630         return_special_tokens_mask=return_special_tokens_mask,
   2631         return_offsets_mapping=return_offsets_mapping,
   2632         return_length=return_length,
   2633         verbose=verbose,
   2634         **kwargs,
   2635     )
   2636 else:
   2637     return self.encode_plus(
   2638         text=text,
   2639         text_pair=text_pair,
   (...)
   2655         **kwargs,
   2656     )

File /home/ec2-user/anaconda3/envs/llm-gen/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2799, in PreTrainedTokenizerBase.batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2782 """
   2783 Tokenize and prepare for the model a list of sequences or a list of pairs of sequences.
   2784 
   (...)
   2795         details in `encode_plus`).
   2796 """
   2798 # Backward compatibility for 'truncation_strategy', 'pad_to_max_length'
-> 2799 padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
   2800     padding=padding,
   2801     truncation=truncation,
   2802     max_length=max_length,
   2803     pad_to_multiple_of=pad_to_multiple_of,
   2804     verbose=verbose,
   2805     **kwargs,
   2806 )
   2808 return self._batch_encode_plus(
   2809     batch_text_or_text_pairs=batch_text_or_text_pairs,
   2810     add_special_tokens=add_special_tokens,
   (...)
   2825     **kwargs,
   2826 )

File /home/ec2-user/anaconda3/envs/llm-gen/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:2436, in PreTrainedTokenizerBase._get_padding_truncation_strategies(self, padding, truncation, max_length, pad_to_multiple_of, verbose, **kwargs)
   2434 # Test if we have a padding token
   2435 if padding_strategy != PaddingStrategy.DO_NOT_PAD and (not self.pad_token or self.pad_token_id < 0):
-> 2436     raise ValueError(
   2437         "Asking to pad but the tokenizer does not have a padding token. "
   2438         "Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` "
   2439         "or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`."
   2440     )
   2442 # Check that we will truncate to a multiple of pad_to_multiple_of if both are provided
   2443 if (
   2444     truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE
   2445     and padding_strategy != PaddingStrategy.DO_NOT_PAD
   (...)
   2448     and (max_length % pad_to_multiple_of != 0)
   2449 ):

ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

adivekar-utexas · 2023-03-22T13:15:35Z

Can you share a link on how GPT2 does it?

adivekar-utexas · 2023-03-22T13:32:08Z

I can confirm that the following works:

from transformers import LlamaTokenizer, LlamaForCausalLM
tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf")

print(repr(tokenizer.pad_token)) ## None
print(repr(tokenizer.bos_token)) ## ''
print(repr(tokenizer.eos_token)) ## ''
print()
tokenizer.add_special_tokens({'pad_token': '[PAD]'})

print(repr(tokenizer.pad_token)) ## ''
print(repr(tokenizer.bos_token)) ## ''
print(repr(tokenizer.eos_token)) ## ''

batch = tokenizer(
    [
        "Singer Billy Joel yesterday ",
        "The primary use of LLaMA is research on large language "
    ],
    return_tensors="pt",
    padding=True
)

amyeroberts · 2023-03-22T13:37:21Z

Glad that it's now working.

As an explanation: the error arising when using tokenizer.pad_token = tokenizer.eos_token is because self.pad_token is set as an empty string which evaluates as False in this check. This seems like an expected exception as it's not possible to pad with an empty string.

In the working example, I think second print of pad token should show:
print(repr(tokenizer.pad_token)) ## '[PAD]'

sgugger · 2023-03-22T13:44:51Z

Note that the EOS token returned by tokenizer.eos_token is wrong in any case (this is a known issue and @ArthurZucker should fix this). The EOS token is not "" but "<s>". Once this issue is fixed, doing tokenizer.pad_token = tokenizer.eos_token will be possible.

suhaskowshik · 2023-03-24T10:36:49Z

There is also a weird issue of increase in vocab size depending on how we add the pad token.

Method 1:

from transformers import LlamaTokenizer, LlamaForCausalLM
tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf")
tokenizer.pad_token='[PAD]'
print(f"pad_token_id={tokenizer.pad_token_id}") #prints 0
print(f"vocab length={len(tokenizer.get_vocab())}") #prints 32000

Method 2
from transformers import LlamaTokenizer, LlamaForCausalLM
tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf")
num_spl_tokens_added=tokenizer.add_special_tokens({'pad_token': '[PAD]'}) #returns 1
print(f"pad_token_id={tokenizer.pad_token_id}") #prints 32000
print(f"vocab length={len(tokenizer.get_vocab())}") #prints 32001

Why is this discrepancy between tokenizer.add_special_tokens({'pad_token': '[PAD]'}) and tokenizer.pad_token='[PAD]' ?

Downstream issues:
The Stanford Alpaca model independently trained on decapoda-research/llama-7b-hf at "chavinlo/alpaca-native" uses tokenizer.add_special_tokens({'pad_token': '[PAD]'}) and hence the model's vocab size is set to 32001.

yukw777 · 2023-03-27T16:05:03Z

I think #22402 should fix this?

github-actions · 2023-04-21T15:01:56Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

basujindal · 2023-05-09T21:37:10Z

I am sorry if it is a wrong question, but don't we need padding token to train model with bs > 1, or are they concatenating sentences together, separated by eos token while training?

Axe-- · 2023-05-21T07:50:25Z

@basujindal

My general understanding for bs > 1, we need to pad during finetuning. However, in pretraining the input text is set to max-length -- you can think of a sliding window over a large text corpora.

ArthurZucker · 2023-05-26T13:58:45Z

Exactly! This was fixed in #22402 so keeping it closed!

christoukmaji · 2023-05-29T09:40:34Z

There is also a weird issue of increase in vocab size depending on how we add the pad token.

Method 1:

from transformers import LlamaTokenizer, LlamaForCausalLM tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf") tokenizer.pad_token='[PAD]' print(f"pad_token_id={tokenizer.pad_token_id}") #prints 0 print(f"vocab length={len(tokenizer.get_vocab())}") #prints 32000

Method 2 from transformers import LlamaTokenizer, LlamaForCausalLM tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf") num_spl_tokens_added=tokenizer.add_special_tokens({'pad_token': '[PAD]'}) #returns 1 print(f"pad_token_id={tokenizer.pad_token_id}") #prints 32000 print(f"vocab length={len(tokenizer.get_vocab())}") #prints 32001

Why is this discrepancy between tokenizer.add_special_tokens({'pad_token': '[PAD]'}) and tokenizer.pad_token='[PAD]' ?

Downstream issues: The Stanford Alpaca model independently trained on decapoda-research/llama-7b-hf at "chavinlo/alpaca-native" uses tokenizer.add_special_tokens({'pad_token': '[PAD]'}) and hence the model's vocab size is set to 32001.

Seems as if this discrepancy is done intentionally. With tranformers==4.30.0.dev0,

from transformers import (
    LlamaForCausalLM, 
    LlamaTokenizer
)
tokenizer = LlamaTokenizer.from_pretrained("/root/HF_llama")
model = LlamaForCausalLM.from_pretrained("/root/HF_llama").to("cuda")

tokenized_text = tokenizer(["some text", "this will cause padding"], padding = True, return_tensors='pt').to("cuda")
model.generate(tokenized_text['input_ids'])

Output

ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token`  

`(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': 

\'[PAD]'})`.

What's the reasoning behind the distinction of the two methods?

ArthurZucker · 2023-05-30T10:07:19Z

Hey @christoukmaji , this kind of question should be asked on the forum.
The first method will set pad_token_id to 2 while the other will give a different index.

Sushentsev · 2023-06-26T15:37:31Z

Note that the EOS token returned by tokenizer.eos_token is wrong in any case (this is a known issue and @ArthurZucker should fix this). The EOS token is not "" but "<s>". Once this issue is fixed, doing tokenizer.pad_token = tokenizer.eos_token will be possible.

I think that bos_token = "<s>" and eos_token = "</s>", you have a mistake.

JaejinCho · 2023-06-27T17:12:25Z

There is also a weird issue of increase in vocab size depending on how we add the pad token.

Method 1:

from transformers import LlamaTokenizer, LlamaForCausalLM tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf") tokenizer.pad_token='[PAD]' print(f"pad_token_id={tokenizer.pad_token_id}") #prints 0 print(f"vocab length={len(tokenizer.get_vocab())}") #prints 32000

Method 2 from transformers import LlamaTokenizer, LlamaForCausalLM tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf") num_spl_tokens_added=tokenizer.add_special_tokens({'pad_token': '[PAD]'}) #returns 1 print(f"pad_token_id={tokenizer.pad_token_id}") #prints 32000 print(f"vocab length={len(tokenizer.get_vocab())}") #prints 32001

Why is this discrepancy between tokenizer.add_special_tokens({'pad_token': '[PAD]'}) and tokenizer.pad_token='[PAD]' ?

Downstream issues: The Stanford Alpaca model independently trained on decapoda-research/llama-7b-hf at "chavinlo/alpaca-native" uses tokenizer.add_special_tokens({'pad_token': '[PAD]'}) and hence the model's vocab size is set to 32001.

So what is the difference between the two and what would be the appropriate practice between the two?

ArthurZucker · 2023-06-28T02:30:14Z

Method 1 does not really work if you want to have a different token for padding and <unk>:

>>> from transformers import LlamaTokenizer, LlamaForCausalLM
>>> tokenizer = LlamaTokenizer.from_pretrained("decapoda-research/llama-7b-hf")
>>> tokenizer.pad_token='[PAD]' 
>>> tokenizer.pad_token
['PAD']
>>> tokenizer.pad_token_id
0
>>> tokenizer.unk_token_id
0

The pad tokens was not added but just set, which means it is unkown and will be always encoded as 0.

brando90 · 2023-07-08T02:13:12Z

the solution suggested here doesn't work afaik if the model doesn't have that token, right?

see: https://stackoverflow.com/questions/76633368/why-does-the-falcon-qlora-tutorial-code-use-eos-token-as-pad-token/76639568#76639568

ArthurZucker · 2023-07-25T16:48:15Z

Given recent release of Llama2,, and in the light of the fact that resizing from 32K to 32K+1 can make inference and training slower, will support padding_index=-1. I'll be working on this soon!

brando90 · 2023-07-25T16:51:11Z

Curious what does padding_index=-1 mean and how does it solve the problem? ----- Brando Miranda Ph.D. Student Computer Science, Stanford University EDGE Scholar, Stanford University ***@***.*** website: https://brando90.github.io/brandomiranda/home.html mentorship opportunities: https://brando90.github.io/brandomiranda/prospective-collaborations.html On Jul 25, 2023, at 9:48 AM, Arthur ***@***.***> wrote: Given recent release of Llama2,, and in the light of the fact that resizing from 32K to 32K+1 can make inference and training slower, will support padding_index=-1. I'll be working on this soon! — Reply to this email directly, view it on GitHub<#22312 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAOE6LRKD5BFJZ6VV5KVVALXR72FVANCNFSM6AAAAAAWDZNEHI>. You are receiving this because you commented.Message ID: ***@***.***>

ArthurZucker · 2023-07-25T17:07:27Z

If you set the padding index of the token embedding layer to -1, you don't need to change the size of the vocab, neither for the model nor for the tokenizer. The embeding layer will send zeros when it will see padding token, as it is supposed to and as it is implemented in the original Llama codebase!

ArthurZucker · 2023-07-25T17:09:53Z

If you want to follow the advances: #25088

kaushal-idx · 2023-08-16T11:33:42Z

@ArthurZucker is the padding problem solved, how we have to set pad token

ArthurZucker · 2023-08-16T11:37:50Z

Hey! PR is not merged yet, should be by the end of the week.!

kaushal-idx · 2023-08-16T11:41:20Z

great , thank you

kwood · 2023-08-17T21:18:04Z

@ArthurZucker looks like it's merged now — thanks for fixing this!

The PR seems to add pad_to_multiple_of — it's a little unclear to me how that fixes this issue. Will llama-2's tokenizer work with batch inference out of the box with this change, or do we need to do something to configure the padding still?

ArthurZucker · 2023-08-18T06:41:23Z

Yes! The idea is that depending on your hardware, you should choose a pad_to_multiple_of value. This is for people who need performance optimisation. Otherwise, just add a padding token and resize normally. Gonna add a little bit of doc today about this!

kwood · 2023-08-18T15:55:46Z

I guess what's unclear is how pad_to_multiple_of addresses the issue you highlighted in your previous comment:

in the light of the fact that resizing from 32K to 32K+1 can make inference and training slower, will support padding_index=-1

I thought the problem here was that we can't add a padding token without going to 32K+1, and using an existing token such as eos or unk is sub-optimal because that was not how the model was trained.

ArthurZucker · 2023-08-18T16:15:06Z

The whole reason for having pad_to_multiple_of is that it will not slow down inference if you pad the embedding matrix to a multiple that is optimised for you sm. The idea is that if you have a gpu optimized on 32K it will be as fast on a 32064 (just an example) not slowing the training down. Did you try to read the linked page in the documentation of pad_to_multiple_of?

If set will pad the embedding matrix to a multiple of the provided value.
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
>= 7.5 (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128. For more
details about this, or help on choosing the correct value for resizing, refer to this guide:
https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc

cringelord000222 · 2023-08-29T09:32:38Z

The whole reason for having pad_to_multiple_of is that it will not slow down inference if you pad the embedding matrix to a multiple that is optimised for you sm. The idea is that if you have a gpu optimized on 32K it will be as fast on a 32064 (just an example) not slowing the training down. Did you try to read the linked page in the documentation of pad_to_multiple_of?

If set will pad the embedding matrix to a multiple of the provided value.
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
>= 7.5 (Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128. For more
details about this, or help on choosing the correct value for resizing, refer to this guide:
https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc

Hi, just discovered this today and I would like to ask how should I set my pad_to_multiple_of to? Should it be 16 or 8?

FYI, I'm currently using a single RTX 4090.
I usually loads my model in 4-bit using the load_in_4bit=True from bitsandbytes.
This is my snippet:

model_id="meta-llama/Llama-2-13b-chat-hf"
tokenizer=AutoTokenizer.from_pretrained(model_id)
tokenizer.add_special_tokens({"pad_token":"<pad>"})

quant_config=BitsAndBytesConfig(
    # load_in_8bit=True,
    # llm_int8_threshold=6.0,
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_type=torch.bfloat16,
)
model = LlamaForCausalLM.from_pretrained(
    model_id,
    quantization_config=quant_config,
    device_map="auto",
    torch_dtype=torch.float16,
)
model.resize_token_embeddings(len(tokenizer),pad_to_multiple_of=8)

ArthurZucker · 2023-08-31T12:59:25Z

Not really, as int4 is not included o the table, would guess 32, but it's a pure guess. Feel free to ask this on the forum, where the community will benefit from this question!

bodasadallah · 2024-01-24T06:32:02Z

@ArthurZucker I still don't understand why setting pad_token to any speical_token (for ex. eos) is not optimal? as the tokenzier will generate an attention mask that ignore padding tokens.

ArthurZucker · 2024-01-24T09:35:29Z

Well, that is because sometimes you need the eos to be attended to. If you train you need to make sure the model attends to the eos no? If you pad with eos then eos is pad and is ignored. Does that make senses?

RaminZi · 2024-01-30T04:42:48Z

@ArthurZucker
Can someone please explain this generation warning: "Setting pad_token_id to eos_token_id:2 for open-end generation."
What is the difference between pad_token_id in model.generate() and the pad_token of the tokenizer? Why do we have to define both separately? Is the generation pad_token_id used for batched inference? If there's padding in the generation phase, why unlike tokenizer, we don't need to set padding side or length, and it only needs the pad_token_id?

bodasadallah · 2024-01-30T05:18:35Z

Well, that is because sometimes you need the eos to be attended to. If you train you need to make sure the model attends to the eos no? If you pad with eos then eos is pad and is ignored. Does that make senses?

Yes. But when you set the padding token to be pos, the attention mask will still be 1 for pos token and 0 for the padding tokens. It's like somehow the tokenizer knows the padding tokens and ignores them regardless of what value they are set to.

panFJCharlotte98 · 2024-02-06T15:55:07Z

Can anyone summarise the best practice so far for batched inference (like for a batched chat completion)? 1) How to set the padding token? Whether to set it to a new <PAD> token and resize the embedding layer or just set to the default EOS token (which is commonly used?) 2) How to set the padding side, left is by default reasonable with decoder-only models or we can try using right-padded to see which gives the better result?

adivekar-utexas changed the title ~~LlamaTokenizer has no pad, bos or eos token, leading to failure during batch-tokenization~~ LlamaTokenizer has no pad token, leading to failure during batch-tokenization Mar 22, 2023

hanchchch mentioned this issue Mar 23, 2023

build: use latest transformers package corca-ai/EVAL#7

Merged

oshev mentioned this issue Apr 26, 2023

"Please select a token to use as pad_token" error for alpaca-lora-7b model EleutherAI/lm-evaluation-harness#434

Open

github-actions bot closed this as completed Apr 29, 2023

Xiuyu-Li mentioned this issue Jun 16, 2023

Unable to reproduce LLaMA-7B results when training from scratch jayelm/gisting#9

Closed

ArthurZucker mentioned this issue Jul 25, 2023

[resize_embedding] Introduce pad_to_multiple_of and guidance #25088

Merged

chawins mentioned this issue Aug 16, 2023

llama2 pad token id centerforaisafety/tdc2023-starter-kit#8

Closed

leeeizhang mentioned this issue Aug 26, 2023

Vocabulary overflow Issue with [PAD] for SFT OpenRLHF/OpenRLHF#87

Closed

mreso mentioned this issue Sep 14, 2023

Fix vocab size mismatch in inference due to added pad token meta-llama/llama-recipes#196

Merged

1 task

psychedelicious mentioned this issue Oct 10, 2023

Update Transformers to 4.35 and fix pad_to_multiple_of invoke-ai/InvokeAI#4817

Merged

12 tasks

n0w0f mentioned this issue Aug 21, 2024

are our tokenizers initalized correctly? lamalab-org/MatText#99

Open

LlamaTokenizer has no pad token, leading to failure during batch-tokenization #22312

LlamaTokenizer has no pad token, leading to failure during batch-tokenization #22312

Comments

adivekar-utexas commented Mar 22, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

adivekar-utexas commented Mar 22, 2023 • edited Loading

ArthurZucker commented Mar 22, 2023

sgugger commented Mar 22, 2023

adivekar-utexas commented Mar 22, 2023 • edited Loading

adivekar-utexas commented Mar 22, 2023

adivekar-utexas commented Mar 22, 2023

amyeroberts commented Mar 22, 2023

sgugger commented Mar 22, 2023 • edited Loading

suhaskowshik commented Mar 24, 2023 • edited Loading

yukw777 commented Mar 27, 2023

github-actions bot commented Apr 21, 2023

basujindal commented May 9, 2023

Axe-- commented May 21, 2023 • edited Loading

ArthurZucker commented May 26, 2023

christoukmaji commented May 29, 2023 • edited Loading

Output

ArthurZucker commented May 30, 2023

Sushentsev commented Jun 26, 2023

JaejinCho commented Jun 27, 2023

ArthurZucker commented Jun 28, 2023

brando90 commented Jul 8, 2023

ArthurZucker commented Jul 25, 2023

brando90 commented Jul 25, 2023 via email

ArthurZucker commented Jul 25, 2023

ArthurZucker commented Jul 25, 2023

kaushal-idx commented Aug 16, 2023

ArthurZucker commented Aug 16, 2023

kaushal-idx commented Aug 16, 2023

kwood commented Aug 17, 2023

ArthurZucker commented Aug 18, 2023

kwood commented Aug 18, 2023

ArthurZucker commented Aug 18, 2023

cringelord000222 commented Aug 29, 2023

ArthurZucker commented Aug 31, 2023

bodasadallah commented Jan 24, 2024

ArthurZucker commented Jan 24, 2024

RaminZi commented Jan 30, 2024

bodasadallah commented Jan 30, 2024

panFJCharlotte98 commented Feb 6, 2024

LlamaTokenizer has no `pad` token, leading to failure during batch-tokenization #22312

LlamaTokenizer has no `pad` token, leading to failure during batch-tokenization #22312

adivekar-utexas commented Mar 22, 2023 •

edited

Loading

adivekar-utexas commented Mar 22, 2023 •

edited

Loading

sgugger commented Mar 22, 2023 •

edited

Loading

suhaskowshik commented Mar 24, 2023 •

edited

Loading

Axe-- commented May 21, 2023 •

edited

Loading

christoukmaji commented May 29, 2023 •

edited

Loading