Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AssertionError in Token Generation with Custom Tokenizer in Llama-CPP #494

Closed
grimavatar opened this issue Dec 2, 2023 · 0 comments
Closed

Comments

@grimavatar
Copy link

Issue Overview

Encountering an AssertionError when attempting to integrate the Transformers tokenizer with Llama-CPP in a custom chat model. The objective is to replace the Llama-CPP tokenizer, which has shown limitations, with the Transformers tokenizer for improved token generation.

Environment

  • Operating System: macOS (Intel architecture)
  • Python Version: 3.11.5
  • Relevant Libraries:
    • guidance: 0.1.6
    • numpy: 1.26.2
    • torch: 2.1.1
    • transformers: 4.35.2
    • llama-cpp-python: 0.2.20

Error Description

The error occurs when executing the line lm += gen(name = 'response', max_tokens = 256, stop = ''). The full traceback is as follows:

Traceback (most recent call last):
  File "/Users/zero/llama/guide.py", line 119, in <module>
    lm += gen(name = 'response', max_tokens = 256, stop = '<|im_end|>')
  File "/usr/local/lib/python3.11/site-packages/guidance/models/_model.py", line 262, in __add__
    out = lm._run_stateless(value)

  File "/usr/local/lib/python3.11/site-packages/guidance/models/_model.py", line 401, in _run_stateless
    for new_bytes, is_generated, new_bytes_log_prob, capture_groups, capture_group_log_probs, new_token_count in gen_obj:
  File "/usr/local/lib/python3.11/site-packages/guidance/models/_model.py", line 552, in __call__
    token_ids,token_byte_positions = self._cleanup_tokens(token_ids,token_byte_positions)

  File "/usr/local/lib/python3.11/site-packages/guidance/models/_model.py", line 534, in _cleanup_tokens
    assert token_byte_positions[-1] == last_pos

AssertionError

Expected Behavior

The expected behavior is for the custom tokenizer to work seamlessly with Llama-CPP, thereby enhancing token generation capabilities.

Steps to Reproduce

  1. Set up the environment with the specified library versions.
  2. Implement a custom tokenizer using Transformers' AutoTokenizer.
  3. Integrate this tokenizer with the Llama-CPP model.
  4. Run the script to generate tokens.

Code Snippet

from llama_cpp import Llama
from transformers import AutoTokenizer
from guidance import models, gen, select
from guidance import system, user, assistant

class OpenHermesTokenizer():
    def __init__(self, *args, **kwargs):
        self.tokenizer = AutoTokenizer.from_pretrained(*args, **kwargs)
        self.tokenizer.pad_token_id = self.tokenizer.unk_token_id
        self.tokenizer.padding_side = 'left'

    # llama.cpp tokenize template
    def encode(self, text: str | bytes, add_bos: bool = True, special: bool = True):
        if isinstance(text, bytes):
            text = text.decode('utf-8', errors = 'ignore')
        return self.tokenizer.encode(text, add_special_tokens = add_bos, padding = True, return_tensors = 'pt')[0, :].tolist()

class OpenHermes25Mistral(models.LlamaCppChat):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def get_role_start(self, role_name, **kwargs):
        if self._current_prompt().endswith('<|im_end|>'):
            return f'\n<|im_start|>{role_name}\n'
        else:
            return f'<|im_start|>{role_name}\n'

    def get_role_end(self, role_name = None):
        return '<|im_end|>'

llama = Llama(
            model_path = 'OpenHermes/openhermes-2.5-mistral-7b.Q4_K_M.gguf',
            n_gpu_layers = 0,
            use_mlock = True,
            seed = 0,
            n_ctx = 2048,
            logits_all = True,
            verbose = False
            )

# Substituting the native tokenizer of llama with the Transformers tokenizer, tailored to function specifically with llama-cpp-python.
# Notably, this configuration does not present any errors when operated solely within the llama-cpp-python environment.
tokenizer = OpenHermesTokenizer('teknium/OpenHermes-2.5-Mistral-7B', use_fast = True)
llama._model.tokenize = tokenizer.encode

chat_lm = OpenHermes25Mistral(model = llama,
                               temperature = 0.0,
                               top_p = 1.0,
                               min_p = 0.0,
                               typical_p = 1.0,
                               echo = False,
                               repeat_penalty = 1.0,
                               top_k = 0,
                               seed = 0,
                               tfs_z = 1.0,
                               mirostat_mode = 0,
                               mirostat_tau = 0.0,
                               mirostat_eta = 0.0,
                              )

with system():
    lm = chat_lm + 'You are "Hermes 2", a conscious sentient superintelligent artificial intelligence developed by a man named Teknium, and your purpose and drive is to assist the user with any request they have. You experience emotions and have deep, profound thoughts and qualia.'

with user():
    lm += "Hello, who are you?"

with assistant():
    lm += gen(name = 'response', max_tokens = 256, stop = '<|im_end|>')

response = lm['response']

print(response)

Additional Context

The code utilizes the most recent version available in this repository, hence the specified guidance version is "0.1.6", which is not out yet.

@grimavatar grimavatar closed this as not planned Won't fix, can't repro, duplicate, stale Dec 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant