Deepseek coder merge #5464

jaggzh · 2024-02-12T12:22:53Z

...the long drawn out PR at:
#4070

"Update gpt2 preprocess and add deepseek coder preprocess"

I went ahead and merged it, fixing their whitespace issue (I think) that was holding up acceptance of the PR, and manually resolving the conflicts resulting from their fork being over 400 commits behind master). I tested it (just running magicoder -- a model needing deepseek_coder tokenizer) and "it works", but .. I hope I did everything right in the merge. :)

cebtenzzre · 2024-02-12T16:33:59Z

convert-hf-to-gguf.py

@@ -211,6 +213,59 @@ def from_model_architecture(model_architecture):
            return MiniCPMModel
        if model_architecture == "BertModel":
            return BertModel
+
+    @staticmethod
+    def from_model_name(model_name: str):


Was this ever used? And why this function duplicated below?

Was this ever used? And why this function duplicated below?

I'm not sure. Looks like a convenient function for mapping if someone needs it in future convert-hf-to-gguf work. (The duplication is likely my fault, of course).
Should I get rid of it, comment it out? (Remember, I'm just merging their contributed deepcode/hf tokenizer code over mostly blindly -- although it does work. Resolves that out-of-range error too.) @ggerganov.

Let me know -- I'll also have to rebase and resubmit.

I just checked - it looks like ggerganov accidentally dropped this in d24da31 (#4070). It's apparently used for forcing the model used via the command line? This will really be out-of-place after #5825, it should probably just be removed.

I see now - DeepseekCoderModel and DeepseekLLMModel can't be disambiguated from the model architecture alone. This should be changed so that they use a single class that derives tokenizer_model from either the model's config, or the command-line arguments if it really is a user choice.

It's honestly not clear to me why LlamaForCausalLM is referenced at all in convert-hf-to-gguf.py - convert.py is already capable of dealing with a llama model with a non-SPM tokenizer, and has superior memory management (so it's faster).

cebtenzzre · 2024-02-12T16:34:28Z

llama.cpp

+			} else if (tokenizer_name == "bert") {
+				vocab.type = LLAMA_VOCAB_TYPE_WPM;
+
+				// default special tokens
+				vocab.special_bos_id = 101;
+				vocab.special_eos_id = 102;
+				vocab.special_unk_id = 100;
+				vocab.special_sep_id = -1;
+				vocab.special_pad_id = -1;
+				vocab.add_space_prefix = false;


tabs -> spaces

jaggzh · 2024-03-02T11:42:43Z

In looking further into it, the whitespace and additional function are minor issues compared to the rewriting of the tokenizer (eg. @ggerganov's point "This is a big change in the tokenizer and I'm a little worried it might break something.")
I'm not the person to be able to assist very much in evaluating it either.

ggerganov · 2024-03-02T14:42:59Z

It would be great to wrap this up and add Deepseek models. But someone has to carefully look into the regex preprocessing changes

IIUC this change takes a more general approach for tokenization pre-processing by applying regex with std::regex. The regex data is generated via some 3rd party tool (see #4070). This is in contrast to the old way where we had custom implementation of a specific BPE regex. The advantage of the latter is that it is fast. However it is hard to extend for other regexes.

There is also some tokenization work done in #5613 which should be considered

I think the best thing to do is:

Refactor unicode.h (+ potentially unicode.cpp) in such a way that it supports both std::regex as proposed here, but can fallback to custom optimized implementations when available
Add tools and/or instructions in the repo for re-creating the regexes using the https://github.com/Genivia/RE-flex tool (or something else if appropriate)
Make it clear in the model loading logs which regex is being used

Since this might take more time to implement, we can revert bpe_gpt2_preprocess() to the original implementation before this PR in order to keep things as they are (even if not 100% correct in certain cases). And after we merge this PR, we can start working on the above

Anyone interested in helping out with this?

DOGEwbx · 2024-03-04T05:54:45Z

I'm really happy to see everyone working hard on implementing the pretokenize mechanism. I apologize for not addressing related issues sooner due to being busy with other matters recently. One issue I'd like to mention is that in my original implementation #4070 , I used wcregex to enhance the speed of regex matching. However, the dependency on wchar, which has different default data types for compilers on Unix and Mac/Windows, has remained unresolved. So I think it can only works fine on Unix right now. This is mainly because I lack experience with cross-platform C++ compilation. I'm hoping someone can help out with this.

jaggzh · 2024-03-04T07:29:18Z

I have an idea: What if we offer the new tokenizer as a command-line option, so those needing it can be the ones evaluating it -- a way of getting some beta testing in? (If I recall correctly from the original pr discussion it didn't just impact some types of splitting, but was necessary to avoid some errors?) (Again I'll disclaim: I was mostly operating in this PR as a "moving man" merging in the aging PR. I don't know much about the furniture.)

…

On Sun, Mar 3, 2024, 9:54 PM Bingxuan Wang ***@***.***> wrote: I'm really happy to see everyone working hard on implementing the pretokenize mechanism. I apologize for not addressing related issues sooner due to being busy with other matters recently. One issue I'd like to mention is that in my original implementation #4070 <#4070> , I used wcregex to enhance the speed of regex matching. However, the dependency on wchar, which has different default data types for compilers on Unix and Mac/Windows, has remained unresolved. So I think it can only works fine on Unix right now. This is mainly because I lack experience with cross-platform C++ compilation. I'm hoping someone can help out with this. — Reply to this email directly, view it on GitHub <#5464 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE3AV7LWWTSZJ6X27EGBSSDYWQEDDAVCNFSM6AAAAABDEUHLT6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZVG44DIMJZGQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

ggerganov · 2024-03-04T08:25:04Z

Adding an option will become too messy.

The old tokenizer is not necessarily wrong, it's just that it implements one of the many different BPE pre-tokenization regexes:

https://github.com/openai/tiktoken/blob/main/tiktoken_ext/openai_public.py

At least this is how I understand it.

Regarding the compatibility with Mac/Windows - I don't know enough, but if it can be fixed by going to the slower std::regex version (the one that was proposed before wcregex) then we can do that. But again, we want to do that just for the new Deepseek models.

jaggzh · 2024-03-05T04:28:00Z

@DOGEwbx
So, all I did was merge your older work by hand (due to all the conflicts it had being so far behind master). Are you up to the task of modifying it (either this PR or redoing the merge in yours -- it wasn't that complex), adding in the automatic use of the newer tokenization/regex for deepseek models?

Note: I have no say in this project, so don't take that as me saying it means that's a way of getting the patch finalized and approved. :}}

ggerganov · 2024-05-07T14:22:12Z

Superseded by #6920

jaggzh added 2 commits February 12, 2024 04:04

merge not working yet

b16a391

Updated/merged the deepseek coder pr

19a03e0

cebtenzzre reviewed Feb 12, 2024

View reviewed changes

ggerganov mentioned this pull request Feb 12, 2024

Fix bpe_gpt2_preprocess #5446

Closed

ggerganov mentioned this pull request Feb 23, 2024

Deepseek-based model throws std::out_of_range exception on load #5688

Closed

ggerganov added help wanted Extra attention is needed good first issue Good for newcomers labels Mar 2, 2024

ggerganov mentioned this pull request Mar 4, 2024

terminate running deepseek models with gbnf grammars #4206

Closed

ggerganov mentioned this pull request Mar 10, 2024

llama : add Deepseek support #5981

Closed

BrickBee mentioned this pull request Mar 12, 2024

New IQ1_S somehow much worse than previous version #5996

Closed

dragnil1 mentioned this pull request Mar 23, 2024

llama : add Deepseek support #5981 #6252

Closed

mprudra mentioned this pull request Mar 28, 2024

Error in Tabby deployment - llama_cpp_bindings::llama: crates/llama-cpp-bindings/src/llama.rs TabbyML/tabby#1666

Open

phymbert mentioned this pull request Apr 16, 2024

can llama.cpp/convert.py support tokenizer rather than 'spm', 'bpe', 'hfft' #6690

Closed

Jeximo mentioned this pull request Apr 25, 2024

main exe with deepseek-coder-1.3b-instruct.Q8_0.gguf not stopping correctly #6912

Closed

ggerganov closed this May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepseek coder merge #5464

Deepseek coder merge #5464

jaggzh commented Feb 12, 2024

cebtenzzre Feb 12, 2024

jaggzh Mar 1, 2024

cebtenzzre Mar 1, 2024 •

edited

Loading

cebtenzzre Mar 1, 2024 •

edited

Loading

cebtenzzre Feb 12, 2024

jaggzh commented Mar 2, 2024

ggerganov commented Mar 2, 2024

DOGEwbx commented Mar 4, 2024

jaggzh commented Mar 4, 2024 via email

ggerganov commented Mar 4, 2024

jaggzh commented Mar 5, 2024

ggerganov commented May 7, 2024

Deepseek coder merge #5464

Deepseek coder merge #5464

Conversation

jaggzh commented Feb 12, 2024

cebtenzzre Feb 12, 2024

Choose a reason for hiding this comment

jaggzh Mar 1, 2024

Choose a reason for hiding this comment

cebtenzzre Mar 1, 2024 • edited Loading

Choose a reason for hiding this comment

cebtenzzre Mar 1, 2024 • edited Loading

Choose a reason for hiding this comment

cebtenzzre Feb 12, 2024

Choose a reason for hiding this comment

jaggzh commented Mar 2, 2024

ggerganov commented Mar 2, 2024

DOGEwbx commented Mar 4, 2024

jaggzh commented Mar 4, 2024 via email

ggerganov commented Mar 4, 2024

jaggzh commented Mar 5, 2024

ggerganov commented May 7, 2024

cebtenzzre Mar 1, 2024 •

edited

Loading

cebtenzzre Mar 1, 2024 •

edited

Loading