llama : improve infill support #9798

ggerganov · 2024-10-09T07:49:59Z

Separating some of the changes from #9787 in this separate PR for clarity:

libllama now auto-detects FIM-based tokens based on their text pieces
add PAD, REPO and SEP FIM tokens
deprecate old llama_token_prefix, llama_token_suffix and llama_token_middle API in favor of new llama_token_fim_ API
llama-server now correctly tokenizes prefix and suffix prompts by not parsing special tokens (/infill endpoint)

Example

# serve FIM-compatible model
./llama-server \
    -m ./models/qwen2.5-7b-coder/ggml-model-q8_0.gguf \
    --port 8012 --ctx-size 32768 -ngl 99 -np 1 -fa

# sample FIM request
curl \
    --silent --no-buffer --request POST --url http://127.0.0.1:8012/infill \
    --header "Content-Type: application/json" \
    --data '{"input_prefix": "#include <cstdio>\n\nint main() {\n    printf(", "input_suffix": ");\n    return 0;\n}\n", "temperature": 0.1, "n_predict": 64, "stream": false, "top_k": 5, "prompt": ""}' | jq

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

ggml-ci

CISC · 2024-10-09T15:54:52Z

src/llama.cpp

+    { LLM_KV_TOKENIZER_FIM_PRE_ID,             "tokenizer.ggml.fim_pre_token_id"         },
+    { LLM_KV_TOKENIZER_FIM_SUF_ID,             "tokenizer.ggml.fim_suf_token_id"         },
+    { LLM_KV_TOKENIZER_FIM_MID_ID,             "tokenizer.ggml.fim_mid_token_id"         },


Changing the name of the metadata keys for these does not seem like a great idea, the tokens themselves are still the same and there are many GGUFs out there with them, changing this will only cause headaches...

I thought it would be OK because the convert script does not export FIM tokens for any model. Initially, this was the hope, that the tokens would be exported during convert to the respective GGUV KVs in the meta data, but this never happened and there is little hope that it will happen in the future.

The auto-detection of the FIM (and other special) tokens, although in theory not perfect, should work in all cases. Even if the old KVs are used or not. So I think it is a safe change, in order to improve the consistency of the names.

Still, let me know if you think I'm missing something.

I thought it would be OK because the convert script does not export FIM tokens for any model. Initially, this was the hope, that the tokens would be exported during convert to the respective GGUV KVs in the meta data, but this never happened and there is little hope that it will happen in the future.

That's incorrect, the conversion script has been adding this for Refact, CodeGemma and CodeLlama based models for quite a while. Additionally I (and I believe a few others as well) have been adding them manually where applicable.

The auto-detection of the FIM (and other special) tokens, although in theory not perfect, should work in all cases. Even if the old KVs are used or not. So I think it is a safe change, in order to improve the consistency of the names.

While the tokens that are auto-detected will cover 99% of the models currently out there, even though a model has these tokens in the vocabulary does not mean it's a good idea to use them, quite a few models either perform poorly or not at all (DeepSeek v1 instruct models f.ex. will spew endless garbage) if you attempt to use them. In these cases it's usually much better to just drop the suffix and feed the model the prefix text and let it generate the rest (this is in fact what llama-cpp-python will do if there are no prefix/suffix tokens).

Still, let me know if you think I'm missing something.

Fill-in-Middle is a tricky subject, even if you ignore the whole PSM/SPM aspect, just look at Yi-Coder-9B, unfortunately I don't think we can just assume it's possible to handle this automatically.

While the tokens that are auto-detected will cover 99% of the models currently out there, even though a model has these tokens in the vocabulary does not mean it's a good idea to use them

That's fine, with this change we are not enforcing the user to use FIM tokens or not. We just want to make it more robust to correctly identify which tokens could be used for FIM when the model is loaded. By "should work in all cases" I am not referring to the code completion use cases, but to the cases of loading a model and having the FIM tokens correctly identified. It's entirely up to the application to decide whether to use them or not.

quite a few models either perform poorly or not at all (DeepSeek v1 instruct models f.ex. will spew endless garbage)

Hm, that's strange. My experience with DeepSeekCoder-6.7B + FIM has been quite positive.

Fill-in-Middle is a tricky subject

Yes, I agree. My idea for these changes is the following (in the context of llama-server):

If you don't want to use FIM with a model, use the /completion endpoint to pass prefix and let it complete it. This path does not require FIM tokens to exist.

If you want to use FIM, use the /infill endpoint to pass the prefix and suffix. This would trigger checks that the model does support FIM and has the FIM tokens be known, otherwise it will return an error.

IMO, the FIM workflow is essential for efficient LLM-based code completion, so it needs to become more robust, regardless if there are models that do not support it.

That's incorrect, the conversion script has been adding this for Refact, CodeGemma and CodeLlama based models for quite a while. Additionally I (and I believe a few others as well) have been adding them manually where applicable.

I think all these models will continue to work exactly the same way with this branch, thanks to the auto-detection logic. Is it not the case?

That's fine, with this change we are not enforcing the user to use FIM tokens or not. We just want to make it more robust to correctly identify which tokens could be used for FIM when the model is loaded. By "should work in all cases" I am not referring to the code completion use cases, but to the cases of loading a model and having the FIM tokens correctly identified. It's entirely up to the application to decide whether to use them or not.

Well, it shifts the burden (which may or may not be fine, but application maintainers will need to be made aware and adapt for this) from the model metadata defining what tokens are acceptable to use or not, to the user/application.

quite a few models either perform poorly or not at all (DeepSeek v1 instruct models f.ex. will spew endless garbage)
Hm, that's strange. My experience with DeepSeekCoder-6.7B + FIM has been quite positive.

The instruct model? IIRC I tested this and could only get the 33B one to give somewhat reasonable results. It could just have been other fine-tunes however (like OpenCodeInterpreter)...

Fill-in-Middle is a tricky subject
Yes, I agree. My idea for these changes is the following (in the context of llama-server):

If you don't want to use FIM with a model, use the /completion endpoint to pass prefix and let it complete it. This path does not require FIM tokens to exist.

If you want to use FIM, use the /infill endpoint to pass the prefix and suffix. This would trigger checks that the model does support FIM and has the FIM tokens be known, otherwise it will return an error.

This approach does indeed seem reasonable, however /infill is not in the OpenAI spec, so I don't think any other server implementations do it this way (and you can't get GitHub Copilot to use /infill)?

IMO, the FIM workflow is essential for efficient LLM-based code completion, so it needs to become more robust, regardless if there are models that do not support it.

For sure, I use it daily, it is somewhat of a pet peeve of mine when GGUFs are made without the necessary metadata, so anything that can make this work smoother is definitely welcome! :)

That's incorrect, the conversion script has been adding this for Refact, CodeGemma and CodeLlama based models for quite a while. Additionally I (and I believe a few others as well) have been adding them manually where applicable.
I think all these models will continue to work exactly the same way with this branch, thanks to the auto-detection logic. Is it not the case?

Yes, though, ironically not DeepSeek because they use different tokens that you have not added. :)

CISC · 2024-10-09T15:56:17Z

src/llama.cpp

+    { LLM_KV_TOKENIZER_FIM_PAD_ID,             "tokenizer.ggml.fim_pad_token_id"         },
+    { LLM_KV_TOKENIZER_FIM_REP_ID,             "tokenizer.ggml.fim_rep_token_id"         },
+    { LLM_KV_TOKENIZER_FIM_SEP_ID,             "tokenizer.ggml.fim_sep_token_id"         },


These need to be added to gguf-py/gguf/constants.py too.

ggml-ci

github-actions bot added examples server labels Oct 9, 2024

ggerganov force-pushed the gg/infill-0 branch from 836a84e to 74c539e Compare October 9, 2024 07:52

llama : improve infill support

61a66f2

ggml-ci

ggerganov force-pushed the gg/infill-0 branch from 74c539e to 61a66f2 Compare October 9, 2024 07:57

ggerganov mentioned this pull request Oct 9, 2024

llama : improve infill support + neovim plugin #9787

Draft

7 tasks

CISC reviewed Oct 9, 2024

View reviewed changes

llama : add more FIM token strings

693c8bf

ggml-ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : improve infill support #9798

llama : improve infill support #9798

ggerganov commented Oct 9, 2024 •

edited

Loading

CISC Oct 9, 2024

ggerganov Oct 9, 2024

CISC Oct 9, 2024

ggerganov Oct 10, 2024

CISC Oct 10, 2024

CISC Oct 9, 2024

llama : improve infill support #9798

Are you sure you want to change the base?

llama : improve infill support #9798

Conversation

ggerganov commented Oct 9, 2024 • edited Loading

Example

CISC Oct 9, 2024

Choose a reason for hiding this comment

ggerganov Oct 9, 2024

Choose a reason for hiding this comment

CISC Oct 9, 2024

Choose a reason for hiding this comment

ggerganov Oct 10, 2024

Choose a reason for hiding this comment

CISC Oct 10, 2024

Choose a reason for hiding this comment

CISC Oct 9, 2024

Choose a reason for hiding this comment

ggerganov commented Oct 9, 2024 •

edited

Loading