Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vocab: refactor tokenizer to reduce the overhead of creating multi times tokenizer #9449

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kylo5aby
Copy link
Contributor

@kylo5aby kylo5aby commented Sep 12, 2024

ref #9369

the PR mainly does:

  • create a tokenizer object after initialize the llama_vocab
  • put the initialization of mutable data of each tokenizer object in llama_tokenize_internal for thread safe

@kylo5aby kylo5aby changed the title refactor tokenizer to reduce the overhead of creating multi times tokenizer vocab: refactor tokenizer to reduce the overhead of creating multi times tokenizer Sep 12, 2024
Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This technically works, but IMO we could improve it a bit:

Currently, llm_tokenizer_data to store working data and llm_tokenizer to store "fixed" shared data. However, because tokenize() is still inside llm_tokenizer, there is nothing prevent it from mutating data inside llm_tokenizer (which is not desirable).

A better solution would be:

  • Move tokenize() function to llm_tokenizer_data (maybe change the class name to llm_tokenizer_session to reflect that the object is short-loved)
  • tokenize() take const llm_tokenizer * as argument. The const is to make sure that the shared object is read-only.

@kylo5aby
Copy link
Contributor Author

This technically works, but IMO we could improve it a bit:

Currently, llm_tokenizer_data to store working data and llm_tokenizer to store "fixed" shared data. However, because tokenize() is still inside llm_tokenizer, there is nothing prevent it from mutating data inside llm_tokenizer (which is not desirable).

A better solution would be:

  • Move tokenize() function to llm_tokenizer_data (maybe change the class name to llm_tokenizer_session to reflect that the object is short-loved)
  • tokenize() take const llm_tokenizer * as argument. The const is to make sure that the shared object is read-only.

@ngxson Thanks your feedback, what approach do you think is better:

  1. declare tokenize in llm_tokenizer_session, and pass const llm_tokenizer * tokenizer, and call tokenizer's related method in the inner, that means the logic implementation(eg. tokenize, append) is still belong to every llm_tokenizer. For example:
struct llm_tokenizer_bpe_session : llm_tokenizer_session {

    void tokenize(const llm_tokenizer * tokenizer, const std::string & text, std::vector<llama_vocab::id> & output) {
        tokenizer->tokenize(text, *this, output);
    }

    std::vector<llm_symbol> symbols;
    std::vector<llm_symbol> symbols_final;
    llm_bigram_bpe::queue work_queue;
};
  1. Keep the structures as in current PR, and declare all methods as const for every llm_tokenizer , so the shared data in every llm_tokenizer are not allowed to modify. That maybe bring smaller change.

  2. move all the logic implementation from every llm_tokenizer to llm_tokenizer_session, the llm_tokenizer will only contains shared data, every time want to use tokenize operations, it must create a llm_tokenizer_session, which maybe bring a break change.

@ngxson
Copy link
Collaborator

ngxson commented Sep 14, 2024

The 3rd option should be the proper way. The idea is:

  • llm_tokenizer contains only fixed/shared data. It will be initialized with the model, stored in model.tokenizer
  • Functions using the tokenizer like llama_tokenize_impl, llama_token_to_piece_impl, llama_detokenize_impl will take const llm_tokenizer & as input. They no longer take const vocab as a param because vocab is already inside const llm_tokenizer

every time want to use tokenize operations, it must create a llm_tokenizer_session, which maybe bring a break change.

I don't see why it's a breaking change compared to what you're currently doing with llm_tokenizer_data. Every time the tokenizer is used, a new llm_tokenizer_session must be created.

The constructor of the session will look like: llm_tokenizer_session(const llm_tokenizer & tokenizer)

src/llama-vocab.h Outdated Show resolved Hide resolved
tokenizer = new llm_tokenizer_rwkv(vocab);
break;
default:
GGML_ABORT("fatal error");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
GGML_ABORT("fatal error");
GGML_ABORT("unknown vocab type");

@@ -1530,6 +1570,32 @@ std::vector<llama_vocab::id> llama_tokenize_internal(const llama_vocab & vocab,
return output;
}

llm_tokenizer * llama_get_tokenizer(const llama_vocab & vocab) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create would be a more appropriate name:

Suggested change
llm_tokenizer * llama_get_tokenizer(const llama_vocab & vocab) {
llm_tokenizer * llama_create_tokenizer(const llama_vocab & vocab) {


switch (vocab.type) {
case LLAMA_VOCAB_TYPE_SPM:
tokenizer = new llm_tokenizer_spm(vocab);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tokenizer = new llm_tokenizer_spm(vocab);
tokenizer = new llm_tokenizer_spm(vocab);

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kylo5aby Are you interested in adding a test that accepts a vocab file (see ./models/ggml-vocab*.gguf and tokenizes random strings in parallel on multiple threads? The test-log test can be used as a starting point. The goal is to run it through thread sanitizer in order to guarantee thread-safety of the tokenization API.

struct llm_tokenizer_wpm {
llm_tokenizer_wpm(const llama_vocab & vocab): vocab(vocab) {}
struct llm_tokenizer_wpm : llm_tokenizer {
llm_tokenizer_wpm(const llama_vocab & vocab): llm_tokenizer(vocab) {}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
llm_tokenizer_wpm(const llama_vocab & vocab): llm_tokenizer(vocab) {}
llm_tokenizer_wpm(const llama_vocab & vocab) : llm_tokenizer(vocab) {}

struct llm_tokenizer_bpe {
llm_tokenizer_bpe(const llama_vocab & vocab): vocab(vocab) {
struct llm_tokenizer_bpe : llm_tokenizer {
llm_tokenizer_bpe(const llama_vocab & vocab): llm_tokenizer(vocab) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
llm_tokenizer_bpe(const llama_vocab & vocab): llm_tokenizer(vocab) {
llm_tokenizer_bpe(const llama_vocab & vocab) : llm_tokenizer(vocab) {

Comment on lines +87 to +105
# build test-tokenizer-parallel target once and add many tests
add_executable(test-tokenizer-parallel test-tokenizer-parallel.cpp)
target_link_libraries(test-tokenizer-parallel PRIVATE common)
install(TARGETS test-tokenizer-parallel RUNTIME)

llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-bert-bge ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-bert-bge.gguf)
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-command-r ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-command-r.gguf)
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-deepseek-coder ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-deepseek-coder.gguf)
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-deepseek-llm ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-deepseek-llm.gguf)
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-falcon ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-falcon.gguf)
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-gpt-2 ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-gpt-2.gguf)
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-llama-bpe ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-llama-bpe.gguf)
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-llama-spm ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-llama-spm.gguf)
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-mpt ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-mpt.gguf)
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-phi-3 ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-phi-3.gguf)
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-qwen2 ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-qwen2.gguf)
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-refact ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-refact.gguf)
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-starcoder ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-starcoder.gguf)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's improve this a bit further - looking at your changes I realized we don't need to create a separate test. We can simply extend the existing test-tokenizer-0 to become multi-threaded. You've pretty much done it in test-tokenizer-parallel.cpp, but just need to store the results and print them to stdout/stderr after joining the threads. We also want to keep support for optional file tokenization at the end - this remains single-threaded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's improve this a bit further - looking at your changes I realized we don't need to create a separate test. We can simply extend the existing test-tokenizer-0 to become multi-threaded. You've pretty much done it in test-tokenizer-parallel.cpp, but just need to store the results and print them to stdout/stderr after joining the threads. We also want to keep support for optional file tokenization at the end - this remains single-threaded.

how about introduce a mutex to make the printing of results orderly?

@github-actions github-actions bot added the testing Everything test related label Sep 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants