-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vocab: refactor tokenizer to reduce the overhead of creating multi times tokenizer #9449
base: master
Are you sure you want to change the base?
Conversation
5447152
to
9ce57e0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This technically works, but IMO we could improve it a bit:
Currently, llm_tokenizer_data
to store working data and llm_tokenizer
to store "fixed" shared data. However, because tokenize()
is still inside llm_tokenizer
, there is nothing prevent it from mutating data inside llm_tokenizer
(which is not desirable).
A better solution would be:
- Move
tokenize()
function tollm_tokenizer_data
(maybe change the class name tollm_tokenizer_session
to reflect that the object is short-loved) tokenize()
takeconst llm_tokenizer *
as argument. Theconst
is to make sure that the shared object is read-only.
@ngxson Thanks your feedback, what approach do you think is better:
struct llm_tokenizer_bpe_session : llm_tokenizer_session {
void tokenize(const llm_tokenizer * tokenizer, const std::string & text, std::vector<llama_vocab::id> & output) {
tokenizer->tokenize(text, *this, output);
}
std::vector<llm_symbol> symbols;
std::vector<llm_symbol> symbols_final;
llm_bigram_bpe::queue work_queue;
};
|
The 3rd option should be the proper way. The idea is:
I don't see why it's a breaking change compared to what you're currently doing with The constructor of the session will look like: |
9ce57e0
to
0166d83
Compare
0166d83
to
5abee9c
Compare
tokenizer = new llm_tokenizer_rwkv(vocab); | ||
break; | ||
default: | ||
GGML_ABORT("fatal error"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GGML_ABORT("fatal error"); | |
GGML_ABORT("unknown vocab type"); |
src/llama-vocab.cpp
Outdated
@@ -1530,6 +1570,32 @@ std::vector<llama_vocab::id> llama_tokenize_internal(const llama_vocab & vocab, | |||
return output; | |||
} | |||
|
|||
llm_tokenizer * llama_get_tokenizer(const llama_vocab & vocab) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create
would be a more appropriate name:
llm_tokenizer * llama_get_tokenizer(const llama_vocab & vocab) { | |
llm_tokenizer * llama_create_tokenizer(const llama_vocab & vocab) { |
src/llama-vocab.cpp
Outdated
|
||
switch (vocab.type) { | ||
case LLAMA_VOCAB_TYPE_SPM: | ||
tokenizer = new llm_tokenizer_spm(vocab); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tokenizer = new llm_tokenizer_spm(vocab); | |
tokenizer = new llm_tokenizer_spm(vocab); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kylo5aby Are you interested in adding a test that accepts a vocab file (see ./models/ggml-vocab*.gguf
and tokenizes random strings in parallel on multiple threads? The test-log
test can be used as a starting point. The goal is to run it through thread sanitizer in order to guarantee thread-safety of the tokenization API.
src/llama-vocab.cpp
Outdated
struct llm_tokenizer_wpm { | ||
llm_tokenizer_wpm(const llama_vocab & vocab): vocab(vocab) {} | ||
struct llm_tokenizer_wpm : llm_tokenizer { | ||
llm_tokenizer_wpm(const llama_vocab & vocab): llm_tokenizer(vocab) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llm_tokenizer_wpm(const llama_vocab & vocab): llm_tokenizer(vocab) {} | |
llm_tokenizer_wpm(const llama_vocab & vocab) : llm_tokenizer(vocab) {} |
src/llama-vocab.cpp
Outdated
struct llm_tokenizer_bpe { | ||
llm_tokenizer_bpe(const llama_vocab & vocab): vocab(vocab) { | ||
struct llm_tokenizer_bpe : llm_tokenizer { | ||
llm_tokenizer_bpe(const llama_vocab & vocab): llm_tokenizer(vocab) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
llm_tokenizer_bpe(const llama_vocab & vocab): llm_tokenizer(vocab) { | |
llm_tokenizer_bpe(const llama_vocab & vocab) : llm_tokenizer(vocab) { |
5abee9c
to
d949c58
Compare
# build test-tokenizer-parallel target once and add many tests | ||
add_executable(test-tokenizer-parallel test-tokenizer-parallel.cpp) | ||
target_link_libraries(test-tokenizer-parallel PRIVATE common) | ||
install(TARGETS test-tokenizer-parallel RUNTIME) | ||
|
||
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-bert-bge ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-bert-bge.gguf) | ||
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-command-r ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-command-r.gguf) | ||
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-deepseek-coder ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-deepseek-coder.gguf) | ||
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-deepseek-llm ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-deepseek-llm.gguf) | ||
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-falcon ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-falcon.gguf) | ||
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-gpt-2 ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-gpt-2.gguf) | ||
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-llama-bpe ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-llama-bpe.gguf) | ||
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-llama-spm ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-llama-spm.gguf) | ||
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-mpt ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-mpt.gguf) | ||
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-phi-3 ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-phi-3.gguf) | ||
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-qwen2 ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-qwen2.gguf) | ||
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-refact ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-refact.gguf) | ||
llama_test(test-tokenizer-parallel NAME test-tokenizer-parallel-starcoder ARGS ${CMAKE_CURRENT_SOURCE_DIR}/../models/ggml-vocab-starcoder.gguf) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's improve this a bit further - looking at your changes I realized we don't need to create a separate test. We can simply extend the existing test-tokenizer-0
to become multi-threaded. You've pretty much done it in test-tokenizer-parallel.cpp
, but just need to store the results and print them to stdout/stderr after joining the threads. We also want to keep support for optional file tokenization at the end - this remains single-threaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's improve this a bit further - looking at your changes I realized we don't need to create a separate test. We can simply extend the existing
test-tokenizer-0
to become multi-threaded. You've pretty much done it intest-tokenizer-parallel.cpp
, but just need to store the results and print them to stdout/stderr after joining the threads. We also want to keep support for optional file tokenization at the end - this remains single-threaded.
how about introduce a mutex
to make the printing of results orderly?
ref #9369
the PR mainly does:
llama_vocab
llama_tokenize_internal
for thread safe