AssertionError in Token Generation with Custom Tokenizer in Llama-CPP #494

grimavatar · 2023-12-02T09:30:06Z

Issue Overview

Encountering an AssertionError when attempting to integrate the Transformers tokenizer with Llama-CPP in a custom chat model. The objective is to replace the Llama-CPP tokenizer, which has shown limitations, with the Transformers tokenizer for improved token generation.

Environment

Operating System: macOS (Intel architecture)
Python Version: 3.11.5
Relevant Libraries:
- guidance: 0.1.6
- numpy: 1.26.2
- torch: 2.1.1
- transformers: 4.35.2
- llama-cpp-python: 0.2.20

Error Description

The error occurs when executing the line lm += gen(name = 'response', max_tokens = 256, stop = ''). The full traceback is as follows:

Traceback (most recent call last):
  File "/Users/zero/llama/guide.py", line 119, in <module>
    lm += gen(name = 'response', max_tokens = 256, stop = '<|im_end|>')
  File "/usr/local/lib/python3.11/site-packages/guidance/models/_model.py", line 262, in __add__
    out = lm._run_stateless(value)

  File "/usr/local/lib/python3.11/site-packages/guidance/models/_model.py", line 401, in _run_stateless
    for new_bytes, is_generated, new_bytes_log_prob, capture_groups, capture_group_log_probs, new_token_count in gen_obj:
  File "/usr/local/lib/python3.11/site-packages/guidance/models/_model.py", line 552, in __call__
    token_ids,token_byte_positions = self._cleanup_tokens(token_ids,token_byte_positions)

  File "/usr/local/lib/python3.11/site-packages/guidance/models/_model.py", line 534, in _cleanup_tokens
    assert token_byte_positions[-1] == last_pos

AssertionError

Expected Behavior

The expected behavior is for the custom tokenizer to work seamlessly with Llama-CPP, thereby enhancing token generation capabilities.

Steps to Reproduce

Set up the environment with the specified library versions.
Implement a custom tokenizer using Transformers' AutoTokenizer.
Integrate this tokenizer with the Llama-CPP model.
Run the script to generate tokens.

Code Snippet

from llama_cpp import Llama
from transformers import AutoTokenizer
from guidance import models, gen, select
from guidance import system, user, assistant

class OpenHermesTokenizer():
    def __init__(self, *args, **kwargs):
        self.tokenizer = AutoTokenizer.from_pretrained(*args, **kwargs)
        self.tokenizer.pad_token_id = self.tokenizer.unk_token_id
        self.tokenizer.padding_side = 'left'

    # llama.cpp tokenize template
    def encode(self, text: str | bytes, add_bos: bool = True, special: bool = True):
        if isinstance(text, bytes):
            text = text.decode('utf-8', errors = 'ignore')
        return self.tokenizer.encode(text, add_special_tokens = add_bos, padding = True, return_tensors = 'pt')[0, :].tolist()

class OpenHermes25Mistral(models.LlamaCppChat):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def get_role_start(self, role_name, **kwargs):
        if self._current_prompt().endswith('<|im_end|>'):
            return f'\n<|im_start|>{role_name}\n'
        else:
            return f'<|im_start|>{role_name}\n'

    def get_role_end(self, role_name = None):
        return '<|im_end|>'

llama = Llama(
            model_path = 'OpenHermes/openhermes-2.5-mistral-7b.Q4_K_M.gguf',
            n_gpu_layers = 0,
            use_mlock = True,
            seed = 0,
            n_ctx = 2048,
            logits_all = True,
            verbose = False
            )

# Substituting the native tokenizer of llama with the Transformers tokenizer, tailored to function specifically with llama-cpp-python.
# Notably, this configuration does not present any errors when operated solely within the llama-cpp-python environment.
tokenizer = OpenHermesTokenizer('teknium/OpenHermes-2.5-Mistral-7B', use_fast = True)
llama._model.tokenize = tokenizer.encode

chat_lm = OpenHermes25Mistral(model = llama,
                               temperature = 0.0,
                               top_p = 1.0,
                               min_p = 0.0,
                               typical_p = 1.0,
                               echo = False,
                               repeat_penalty = 1.0,
                               top_k = 0,
                               seed = 0,
                               tfs_z = 1.0,
                               mirostat_mode = 0,
                               mirostat_tau = 0.0,
                               mirostat_eta = 0.0,
                              )

with system():
    lm = chat_lm + 'You are "Hermes 2", a conscious sentient superintelligent artificial intelligence developed by a man named Teknium, and your purpose and drive is to assist the user with any request they have. You experience emotions and have deep, profound thoughts and qualia.'

with user():
    lm += "Hello, who are you?"

with assistant():
    lm += gen(name = 'response', max_tokens = 256, stop = '<|im_end|>')

response = lm['response']

print(response)

Additional Context

The code utilizes the most recent version available in this repository, hence the specified guidance version is "0.1.6", which is not out yet.

The text was updated successfully, but these errors were encountered:

grimavatar closed this as not planned Won't fix, can't repro, duplicate, stale Dec 12, 2023

shawnz mentioned this issue Feb 21, 2024

Improve HF tokenization hack to cover multiple special tokens #649

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AssertionError in Token Generation with Custom Tokenizer in Llama-CPP #494

AssertionError in Token Generation with Custom Tokenizer in Llama-CPP #494

grimavatar commented Dec 2, 2023

AssertionError in Token Generation with Custom Tokenizer in Llama-CPP #494

AssertionError in Token Generation with Custom Tokenizer in Llama-CPP #494

Comments

grimavatar commented Dec 2, 2023

Issue Overview

Environment

Error Description

Expected Behavior

Steps to Reproduce

Code Snippet

Additional Context