vocab: refactor tokenizer to reduce the overhead of creating multi times tokenizer #9449

kylo5aby · 2024-09-12T09:40:20Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

the PR mainly does:

create a tokenizer object after initialize the llama_vocab
put the initialization of mutable data of each tokenizer object in llama_tokenize_internal for thread safe

ngxson

This technically works, but IMO we could improve it a bit:

Currently, llm_tokenizer_data to store working data and llm_tokenizer to store "fixed" shared data. However, because tokenize() is still inside llm_tokenizer, there is nothing prevent it from mutating data inside llm_tokenizer (which is not desirable).

A better solution would be:

Move tokenize() function to llm_tokenizer_data (maybe change the class name to llm_tokenizer_session to reflect that the object is short-loved)
tokenize() take const llm_tokenizer * as argument. The const is to make sure that the shared object is read-only.

kylo5aby · 2024-09-14T02:56:51Z

This technically works, but IMO we could improve it a bit:

Currently, llm_tokenizer_data to store working data and llm_tokenizer to store "fixed" shared data. However, because tokenize() is still inside llm_tokenizer, there is nothing prevent it from mutating data inside llm_tokenizer (which is not desirable).

A better solution would be:

Move tokenize() function to llm_tokenizer_data (maybe change the class name to llm_tokenizer_session to reflect that the object is short-loved)

tokenize() take const llm_tokenizer * as argument. The const is to make sure that the shared object is read-only.

@ngxson Thanks your feedback, what approach do you think is better:

declare tokenize in llm_tokenizer_session, and pass const llm_tokenizer * tokenizer, and call tokenizer's related method in the inner, that means the logic implementation(eg. tokenize, append) is still belong to every llm_tokenizer. For example:

struct llm_tokenizer_bpe_session : llm_tokenizer_session {

    void tokenize(const llm_tokenizer * tokenizer, const std::string & text, std::vector<llama_vocab::id> & output) {
        tokenizer->tokenize(text, *this, output);
    }

    std::vector<llm_symbol> symbols;
    std::vector<llm_symbol> symbols_final;
    llm_bigram_bpe::queue work_queue;
};

Keep the structures as in current PR, and declare all methods as const for every llm_tokenizer , so the shared data in every llm_tokenizer are not allowed to modify. That maybe bring smaller change.
move all the logic implementation from every llm_tokenizer to llm_tokenizer_session, the llm_tokenizer will only contains shared data, every time want to use tokenize operations, it must create a llm_tokenizer_session, which maybe bring a break change.

ngxson · 2024-09-14T08:15:53Z

The 3rd option should be the proper way. The idea is:

llm_tokenizer contains only fixed/shared data. It will be initialized with the model, stored in model.tokenizer
Functions using the tokenizer like llama_tokenize_impl, llama_token_to_piece_impl, llama_detokenize_impl will take const llm_tokenizer & as input. They no longer take const vocab as a param because vocab is already inside const llm_tokenizer

every time want to use tokenize operations, it must create a llm_tokenizer_session, which maybe bring a break change.

I don't see why it's a breaking change compared to what you're currently doing with llm_tokenizer_data. Every time the tokenizer is used, a new llm_tokenizer_session must be created.

The constructor of the session will look like: llm_tokenizer_session(const llm_tokenizer & tokenizer)

src/llama-vocab.h

ngxson · 2024-09-19T11:39:56Z

src/llama-vocab.cpp

+            tokenizer = new llm_tokenizer_rwkv(vocab);
+            break;
+        default:
+            GGML_ABORT("fatal error");


Suggested change

GGML_ABORT("fatal error");

GGML_ABORT("unknown vocab type");

ngxson · 2024-09-19T11:40:30Z

src/llama-vocab.cpp

@@ -1530,6 +1570,32 @@ std::vector<llama_vocab::id> llama_tokenize_internal(const llama_vocab & vocab,
    return output;
 }

+llm_tokenizer * llama_get_tokenizer(const llama_vocab & vocab) {


create would be a more appropriate name:

Suggested change

llm_tokenizer * llama_get_tokenizer(const llama_vocab & vocab) {

llm_tokenizer * llama_create_tokenizer(const llama_vocab & vocab) {

ngxson · 2024-09-19T11:46:46Z

src/llama-vocab.cpp

+
+    switch (vocab.type) {
+        case LLAMA_VOCAB_TYPE_SPM:
+            tokenizer =  new llm_tokenizer_spm(vocab);


Suggested change

tokenizer = new llm_tokenizer_spm(vocab);

tokenizer = new llm_tokenizer_spm(vocab);

ggerganov

@kylo5aby Are you interested in adding a test that accepts a vocab file (see ./models/ggml-vocab*.gguf and tokenizes random strings in parallel on multiple threads? The test-log test can be used as a starting point. The goal is to run it through thread sanitizer in order to guarantee thread-safety of the tokenization API.

ggerganov · 2024-09-19T16:27:00Z

src/llama-vocab.cpp

-struct llm_tokenizer_wpm {
-    llm_tokenizer_wpm(const llama_vocab & vocab): vocab(vocab) {}
+struct llm_tokenizer_wpm : llm_tokenizer {
+    llm_tokenizer_wpm(const llama_vocab & vocab): llm_tokenizer(vocab) {}


Suggested change

llm_tokenizer_wpm(const llama_vocab & vocab): llm_tokenizer(vocab) {}

llm_tokenizer_wpm(const llama_vocab & vocab) : llm_tokenizer(vocab) {}

ggerganov · 2024-09-19T16:27:09Z

src/llama-vocab.cpp

-struct llm_tokenizer_bpe {
-    llm_tokenizer_bpe(const llama_vocab & vocab): vocab(vocab) {
+struct llm_tokenizer_bpe : llm_tokenizer {
+    llm_tokenizer_bpe(const llama_vocab & vocab): llm_tokenizer(vocab) {


Suggested change

llm_tokenizer_bpe(const llama_vocab & vocab): llm_tokenizer(vocab) {

llm_tokenizer_bpe(const llama_vocab & vocab) : llm_tokenizer(vocab) {

ggerganov · 2024-09-20T07:21:55Z

tests/CMakeLists.txt

+# build test-tokenizer-parallel target once and add many tests
+add_executable(test-tokenizer-parallel test-tokenizer-parallel.cpp)
+target_link_libraries(test-tokenizer-parallel PRIVATE common)
+install(TARGETS test-tokenizer-parallel RUNTIME)
+
+llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-bert-bge          ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-bert-bge.gguf)
+llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-command-r         ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-command-r.gguf)
+llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-deepseek-coder    ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-deepseek-coder.gguf)
+llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-deepseek-llm      ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-deepseek-llm.gguf)
+llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-falcon            ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-falcon.gguf)
+llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-gpt-2             ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-gpt-2.gguf)
+llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-llama-bpe         ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-llama-bpe.gguf)
+llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-llama-spm         ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-llama-spm.gguf)
+llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-mpt               ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-mpt.gguf)
+llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-phi-3             ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-phi-3.gguf)
+llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-qwen2             ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-qwen2.gguf)
+llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-refact            ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-refact.gguf)
+llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-starcoder         ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-starcoder.gguf)
+


Let's improve this a bit further - looking at your changes I realized we don't need to create a separate test. We can simply extend the existing test-tokenizer-0 to become multi-threaded. You've pretty much done it in test-tokenizer-parallel.cpp, but just need to store the results and print them to stdout/stderr after joining the threads. We also want to keep support for optional file tokenization at the end - this remains single-threaded.

Let's improve this a bit further - looking at your changes I realized we don't need to create a separate test. We can simply extend the existing test-tokenizer-0 to become multi-threaded. You've pretty much done it in test-tokenizer-parallel.cpp, but just need to store the results and print them to stdout/stderr after joining the threads. We also want to keep support for optional file tokenization at the end - this remains single-threaded.

how about introduce a mutex to make the printing of results orderly?

kylo5aby changed the title ~~refactor tokenizer to reduce the overhead of creating multi times tokenizer~~ vocab: refactor tokenizer to reduce the overhead of creating multi times tokenizer Sep 12, 2024

kylo5aby force-pushed the refactor-tokenizer branch from 5447152 to 9ce57e0 Compare September 12, 2024 10:13

ngxson reviewed Sep 12, 2024

View reviewed changes

kylo5aby force-pushed the refactor-tokenizer branch from 9ce57e0 to 0166d83 Compare September 18, 2024 09:52

ngxson reviewed Sep 18, 2024

View reviewed changes

src/llama-vocab.h Outdated Show resolved Hide resolved

kylo5aby force-pushed the refactor-tokenizer branch from 0166d83 to 5abee9c Compare September 19, 2024 08:58

ngxson approved these changes Sep 19, 2024

View reviewed changes

ngxson requested a review from ggerganov September 19, 2024 11:49

ggerganov reviewed Sep 19, 2024

View reviewed changes

ggerganov mentioned this pull request Sep 19, 2024

llama : refactor llama_vocab #9369

Open

refactor tokenizer

d949c58

kylo5aby force-pushed the refactor-tokenizer branch from 5abee9c to d949c58 Compare September 20, 2024 07:05

ggerganov reviewed Sep 20, 2024

View reviewed changes

github-actions bot added the testing Everything test related label Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vocab: refactor tokenizer to reduce the overhead of creating multi times tokenizer #9449

vocab: refactor tokenizer to reduce the overhead of creating multi times tokenizer #9449

kylo5aby commented Sep 12, 2024 •

edited by ggerganov

Loading

ngxson left a comment •

edited

Loading

kylo5aby commented Sep 14, 2024

ngxson commented Sep 14, 2024 •

edited

Loading

ngxson Sep 19, 2024

ngxson Sep 19, 2024

ngxson Sep 19, 2024

ggerganov left a comment •

edited

Loading

ggerganov Sep 19, 2024

ggerganov Sep 19, 2024

ggerganov Sep 20, 2024

kylo5aby Sep 20, 2024

	llm_tokenizer * llama_get_tokenizer(const llama_vocab & vocab) {
	llm_tokenizer * llama_create_tokenizer(const llama_vocab & vocab) {

	tokenizer = new llm_tokenizer_spm(vocab);
	tokenizer = new llm_tokenizer_spm(vocab);

	llm_tokenizer_wpm(const llama_vocab & vocab): llm_tokenizer(vocab) {}
	llm_tokenizer_wpm(const llama_vocab & vocab) : llm_tokenizer(vocab) {}

	llm_tokenizer_bpe(const llama_vocab & vocab): llm_tokenizer(vocab) {
	llm_tokenizer_bpe(const llama_vocab & vocab) : llm_tokenizer(vocab) {

vocab: refactor tokenizer to reduce the overhead of creating multi times tokenizer #9449

Are you sure you want to change the base?

vocab: refactor tokenizer to reduce the overhead of creating multi times tokenizer #9449

Conversation

kylo5aby commented Sep 12, 2024 • edited by ggerganov Loading

ngxson left a comment • edited Loading

Choose a reason for hiding this comment

kylo5aby commented Sep 14, 2024

ngxson commented Sep 14, 2024 • edited Loading

ngxson Sep 19, 2024

Choose a reason for hiding this comment

ngxson Sep 19, 2024

Choose a reason for hiding this comment

ngxson Sep 19, 2024

Choose a reason for hiding this comment

ggerganov left a comment • edited Loading

Choose a reason for hiding this comment

ggerganov Sep 19, 2024

Choose a reason for hiding this comment

ggerganov Sep 19, 2024

Choose a reason for hiding this comment

ggerganov Sep 20, 2024

Choose a reason for hiding this comment

kylo5aby Sep 20, 2024

Choose a reason for hiding this comment

kylo5aby commented Sep 12, 2024 •

edited by ggerganov

Loading

ngxson left a comment •

edited

Loading

ngxson commented Sep 14, 2024 •

edited

Loading

ggerganov left a comment •

edited

Loading