Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : improve infill support #9798

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

llama : improve infill support #9798

wants to merge 2 commits into from

Conversation

ggerganov
Copy link
Owner

@ggerganov ggerganov commented Oct 9, 2024

Separating some of the changes from #9787 in this separate PR for clarity:

  • libllama now auto-detects FIM-based tokens based on their text pieces
  • add PAD, REPO and SEP FIM tokens
  • deprecate old llama_token_prefix, llama_token_suffix and llama_token_middle API in favor of new llama_token_fim_ API
  • llama-server now correctly tokenizes prefix and suffix prompts by not parsing special tokens (/infill endpoint)

Example

# serve FIM-compatible model
./llama-server \
    -m ./models/qwen2.5-7b-coder/ggml-model-q8_0.gguf \
    --port 8012 --ctx-size 32768 -ngl 99 -np 1 -fa

# sample FIM request
curl \
    --silent --no-buffer --request POST --url http://127.0.0.1:8012/infill \
    --header "Content-Type: application/json" \
    --data '{"input_prefix": "#include <cstdio>\n\nint main() {\n    printf(", "input_suffix": ");\n    return 0;\n}\n", "temperature": 0.1, "n_predict": 64, "stream": false, "top_k": 5, "prompt": ""}' | jq

Comment on lines +477 to +479
{ LLM_KV_TOKENIZER_FIM_PRE_ID, "tokenizer.ggml.fim_pre_token_id" },
{ LLM_KV_TOKENIZER_FIM_SUF_ID, "tokenizer.ggml.fim_suf_token_id" },
{ LLM_KV_TOKENIZER_FIM_MID_ID, "tokenizer.ggml.fim_mid_token_id" },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the name of the metadata keys for these does not seem like a great idea, the tokens themselves are still the same and there are many GGUFs out there with them, changing this will only cause headaches...

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it would be OK because the convert script does not export FIM tokens for any model. Initially, this was the hope, that the tokens would be exported during convert to the respective GGUV KVs in the meta data, but this never happened and there is little hope that it will happen in the future.

The auto-detection of the FIM (and other special) tokens, although in theory not perfect, should work in all cases. Even if the old KVs are used or not. So I think it is a safe change, in order to improve the consistency of the names.

Still, let me know if you think I'm missing something.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it would be OK because the convert script does not export FIM tokens for any model. Initially, this was the hope, that the tokens would be exported during convert to the respective GGUV KVs in the meta data, but this never happened and there is little hope that it will happen in the future.

That's incorrect, the conversion script has been adding this for Refact, CodeGemma and CodeLlama based models for quite a while. Additionally I (and I believe a few others as well) have been adding them manually where applicable.

The auto-detection of the FIM (and other special) tokens, although in theory not perfect, should work in all cases. Even if the old KVs are used or not. So I think it is a safe change, in order to improve the consistency of the names.

While the tokens that are auto-detected will cover 99% of the models currently out there, even though a model has these tokens in the vocabulary does not mean it's a good idea to use them, quite a few models either perform poorly or not at all (DeepSeek v1 instruct models f.ex. will spew endless garbage) if you attempt to use them. In these cases it's usually much better to just drop the suffix and feed the model the prefix text and let it generate the rest (this is in fact what llama-cpp-python will do if there are no prefix/suffix tokens).

Still, let me know if you think I'm missing something.

Fill-in-Middle is a tricky subject, even if you ignore the whole PSM/SPM aspect, just look at Yi-Coder-9B, unfortunately I don't think we can just assume it's possible to handle this automatically.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the tokens that are auto-detected will cover 99% of the models currently out there, even though a model has these tokens in the vocabulary does not mean it's a good idea to use them

That's fine, with this change we are not enforcing the user to use FIM tokens or not. We just want to make it more robust to correctly identify which tokens could be used for FIM when the model is loaded. By "should work in all cases" I am not referring to the code completion use cases, but to the cases of loading a model and having the FIM tokens correctly identified. It's entirely up to the application to decide whether to use them or not.

quite a few models either perform poorly or not at all (DeepSeek v1 instruct models f.ex. will spew endless garbage)

Hm, that's strange. My experience with DeepSeekCoder-6.7B + FIM has been quite positive.

Fill-in-Middle is a tricky subject

Yes, I agree. My idea for these changes is the following (in the context of llama-server):

  • If you don't want to use FIM with a model, use the /completion endpoint to pass prefix and let it complete it. This path does not require FIM tokens to exist.
  • If you want to use FIM, use the /infill endpoint to pass the prefix and suffix. This would trigger checks that the model does support FIM and has the FIM tokens be known, otherwise it will return an error.

IMO, the FIM workflow is essential for efficient LLM-based code completion, so it needs to become more robust, regardless if there are models that do not support it.

That's incorrect, the conversion script has been adding this for Refact, CodeGemma and CodeLlama based models for quite a while. Additionally I (and I believe a few others as well) have been adding them manually where applicable.

I think all these models will continue to work exactly the same way with this branch, thanks to the auto-detection logic. Is it not the case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine, with this change we are not enforcing the user to use FIM tokens or not. We just want to make it more robust to correctly identify which tokens could be used for FIM when the model is loaded. By "should work in all cases" I am not referring to the code completion use cases, but to the cases of loading a model and having the FIM tokens correctly identified. It's entirely up to the application to decide whether to use them or not.

Well, it shifts the burden (which may or may not be fine, but application maintainers will need to be made aware and adapt for this) from the model metadata defining what tokens are acceptable to use or not, to the user/application.

quite a few models either perform poorly or not at all (DeepSeek v1 instruct models f.ex. will spew endless garbage)
Hm, that's strange. My experience with DeepSeekCoder-6.7B + FIM has been quite positive.

The instruct model? IIRC I tested this and could only get the 33B one to give somewhat reasonable results. It could just have been other fine-tunes however (like OpenCodeInterpreter)...

Fill-in-Middle is a tricky subject
Yes, I agree. My idea for these changes is the following (in the context of llama-server):

  • If you don't want to use FIM with a model, use the /completion endpoint to pass prefix and let it complete it. This path does not require FIM tokens to exist.
  • If you want to use FIM, use the /infill endpoint to pass the prefix and suffix. This would trigger checks that the model does support FIM and has the FIM tokens be known, otherwise it will return an error.

This approach does indeed seem reasonable, however /infill is not in the OpenAI spec, so I don't think any other server implementations do it this way (and you can't get GitHub Copilot to use /infill)?

IMO, the FIM workflow is essential for efficient LLM-based code completion, so it needs to become more robust, regardless if there are models that do not support it.

For sure, I use it daily, it is somewhat of a pet peeve of mine when GGUFs are made without the necessary metadata, so anything that can make this work smoother is definitely welcome! :)

That's incorrect, the conversion script has been adding this for Refact, CodeGemma and CodeLlama based models for quite a while. Additionally I (and I believe a few others as well) have been adding them manually where applicable.
I think all these models will continue to work exactly the same way with this branch, thanks to the auto-detection logic. Is it not the case?

Yes, though, ironically not DeepSeek because they use different tokens that you have not added. :)

Comment on lines +480 to +482
{ LLM_KV_TOKENIZER_FIM_PAD_ID, "tokenizer.ggml.fim_pad_token_id" },
{ LLM_KV_TOKENIZER_FIM_REP_ID, "tokenizer.ggml.fim_rep_token_id" },
{ LLM_KV_TOKENIZER_FIM_SEP_ID, "tokenizer.ggml.fim_sep_token_id" },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These need to be added to gguf-py/gguf/constants.py too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants