fixes infill incorrect tokenization #3508

vvhg1 · 2023-10-06T17:06:54Z

…ing space token from suffix when params.escape

ggerganov · 2023-10-06T18:47:12Z

examples/infill/infill.cpp

    std::vector<llama_token> inp_sfx = ::llama_tokenize(ctx, params.input_suffix, false);
+    const int space_token = 29871;
+    if (params.escape && inp_sfx.size() > 1 && inp_sfx[0] == space_token) {


I think we always want to remove the leading space, regardless if the params.escape is set or not:

Suggested change

if (params.escape && inp_sfx.size() > 1 && inp_sfx[0] == space_token) {

if (inp_sfx.size() > 1 && inp_sfx[0] == space_token) {

I think we need the params.escape check as only then we add an extra space to the suffix.

…ing space token from suffix when params.escape

…or rm added space token

ggerganov · 2023-10-07T07:19:44Z

examples/infill/infill.cpp

+    bool suff_rm_leading_spc = params.escape;
+    if (suff_rm_leading_spc && params.input_suffix.find_first_of(" ") == 0 && params.input_suffix.size() > 1) {
+        params.input_suffix.erase(0, 1);
+        suff_rm_leading_spc = false;
+    }


Hm, I could be missing something, but this whole section should not be needed.

If you keep everything else and just replace this block with:

Suggested change

bool suff_rm_leading_spc = params.escape;

if (suff_rm_leading_spc && params.input_suffix.find_first_of(" ") == 0 && params.input_suffix.size() > 1) {

params.input_suffix.erase(0, 1);

suff_rm_leading_spc = false;

}

const bool suff_rm_leading_spc = true;

Can you find a case that does not match the Python version?

yes, I can find cases where it misbehaves:
with params.escape, if the suffix starts with a space, the tokenizer adds a space and THEN tokenizes the double space to 259 so we can't remove the surplus space anymore but would have to "translate" the 259 to 29871 to remove the space, same with more than one leading spaces...
without params.escape if the suffix starts with a space, we remove it so it does not reflect the original suffix

not sure what the python version does as I can't test it, but the above behaviour seems wrong

Ok, that makes sense. Let's double check, just in case:
@kherud Could you please run the Python codellama with suffix with 2 leading spaces and see what tokens are produced for infill?

'<PRE> <SUF>': 1, 32007, 259, 32008, 32009 '<PRE> <SUF> ': 1, 32007, 259, 32008, 29871, 32009 '<PRE> <SUF> ': 1, 32007, 259, 32008, 259, 32009 '<PRE> <SUF> ': 1, 32007, 259, 32008, 1678, 32009 '<PRE> <SUF> ': 1, 32007, 259, 32008, 268, 32009

…or rm added space token

…ed back or rm added space token" This reverts commit 63ba0b6.

…andling

vvhg1 · 2023-10-07T11:00:34Z

There seems to be another issue too, in interactive mode the -e seems to only act on the initial input as the processing happens when the params are loaded in common.cpp:617
Is that a bug? seems to be a bug for infill so I fixed it there, but what about main, others?
params.cfg_negative_prompt is not being escaped anywhere it seems, is that another bug or desired behaviour?

ggerganov · 2023-10-09T11:55:09Z

This is probably an oversight - we should escape the user input and params.cfg_negative_prompt as well

vvhg1 · 2023-10-09T13:08:22Z

This is probably an oversight - we should escape the user input and params.cfg_negative_prompt as well

Should be an easy fix as I am doing that already in the infill script.
Do you want to have the fix here or in a separate pr?

ggerganov

Anyway you prefer.

If the results match the Python version now, I think we can merge.
I can do some testing later, or we can wait for some feedback from other people.

vvhg1 · 2023-10-09T13:26:51Z

Anyway you prefer.

Would keep it separate as the issues are related but not identical?

If the results match the Python version now, I think we can merge. I can do some testing later, or we can wait for some feedback from other people.

I think it matches now, your call to merge or wait.

ggerganov · 2023-10-09T13:32:54Z

Ok, will merge this first thing tomorrow unless somebody notices any problems

…example * 'master' of github.com:ggerganov/llama.cpp: (34 commits) examples: support LLaVA v1.5 (multimodal model) (ggerganov#3436) docs : fix typo GOMP_CPU_AFFINITY (ggerganov#3597) cmake : fix add_compile_options on macOS typo : it is `--n-gpu-layers` not `--gpu-layers` (ggerganov#3592) ci : check if there is enough VRAM (ggerganov#3596) server : add completion mode (no chat) (ggerganov#3582) prompts : add mnemonics.txt server : fix kv cache management (ggerganov#3588) main : fix session loading bug (ggerganov#3400) server : add parameter -tb N, --threads-batch N (ggerganov#3584) common : fix mirostat state when using multiple sequences (ggerganov#3543) batched : add bench tool (ggerganov#3545) examples : add batched.swift + improve CI for swift (ggerganov#3562) Add MPT model to supported models in README.md (ggerganov#3574) Minor improvements in GPT2 tokenizer (ggerganov#3567) readme : add bloom (ggerganov#3570) llm : add bloom models (ggerganov#3553) swift : improvements and fixes (ggerganov#3564) llm : add MPT support (ggerganov#3417) infill. : fix tokenization (ggerganov#3508) ...

vvhg1 and others added 3 commits October 6, 2023 18:26

infill tokens correction

dfeda32

Merge branch 'ggerganov:master' into master

8bd24b2

serverinfill tokens correction

6796e74

vvhg1 marked this pull request as draft October 6, 2023 17:07

vvhg1 changed the title ~~fixes infill incorrect tokenization #3503~~ fixes infill incorrect tokenization Oct 6, 2023

removing any leading whitespace from infill suffix and removing leead…

377be2f

…ing space token from suffix when params.escape

ggerganov reviewed Oct 6, 2023

View reviewed changes

vvhg1 added 2 commits October 6, 2023 21:53

removing any leading whitespace from infill suffix and removing leead…

b4046aa

…ing space token from suffix when params.escape

only rm when params.escape, rm space if possible which is added back …

0526560

…or rm added space token

ggerganov reviewed Oct 7, 2023

View reviewed changes

vvhg1 added 2 commits October 7, 2023 09:22

only rm when params.escape, rm space if possible which is added back …

63ba0b6

…or rm added space token

Revert "only rm when params.escape, rm space if possible which is add…

003c15b

…ed back or rm added space token" This reverts commit 63ba0b6.

vvhg1 marked this pull request as ready for review October 7, 2023 07:59

Merge branch 'master' of github.com:ggerganov/llama.cpp

fc01dc0

vvhg1 marked this pull request as draft October 7, 2023 08:18

vvhg1 added 2 commits October 7, 2023 12:07

fix interactive prompt escaping and fix server infill leading space h…

c3a7f84

…andling

rm unnecessary bool check

b1b6bef

Merge branch 'master' of github.com:ggerganov/llama.cpp

4a21468

vvhg1 marked this pull request as ready for review October 8, 2023 11:21

vvhg1 added 2 commits October 8, 2023 13:23

Merge branch 'master' of github.com:ggerganov/llama.cpp

d9dae93

Merge branch 'master' of github.com:ggerganov/llama.cpp

141329f

ggerganov approved these changes Oct 9, 2023

View reviewed changes

ggerganov added the need feedback Testing and feedback with results are needed label Oct 9, 2023

Merge branch 'master' of github.com:ggerganov/llama.cpp

3517729

ggerganov approved these changes Oct 10, 2023

View reviewed changes

ggerganov merged commit 11ea5c7 into ggerganov:master Oct 10, 2023
39 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixes infill incorrect tokenization #3508

fixes infill incorrect tokenization #3508

vvhg1 commented Oct 6, 2023

ggerganov Oct 6, 2023

vvhg1 Oct 7, 2023

ggerganov Oct 7, 2023

vvhg1 Oct 7, 2023 •

edited

Loading

ggerganov Oct 7, 2023

kherud Oct 8, 2023 •

edited

Loading

vvhg1 commented Oct 7, 2023 •

edited

Loading

ggerganov commented Oct 9, 2023

vvhg1 commented Oct 9, 2023 •

edited

Loading

ggerganov left a comment

vvhg1 commented Oct 9, 2023

ggerganov commented Oct 9, 2023

	if (params.escape && inp_sfx.size() > 1 && inp_sfx[0] == space_token) {
	if (inp_sfx.size() > 1 && inp_sfx[0] == space_token) {

fixes infill incorrect tokenization #3508

fixes infill incorrect tokenization #3508

Conversation

vvhg1 commented Oct 6, 2023

ggerganov Oct 6, 2023

Choose a reason for hiding this comment

vvhg1 Oct 7, 2023

Choose a reason for hiding this comment

ggerganov Oct 7, 2023

Choose a reason for hiding this comment

vvhg1 Oct 7, 2023 • edited Loading

Choose a reason for hiding this comment

ggerganov Oct 7, 2023

Choose a reason for hiding this comment

kherud Oct 8, 2023 • edited Loading

Choose a reason for hiding this comment

vvhg1 commented Oct 7, 2023 • edited Loading

ggerganov commented Oct 9, 2023

vvhg1 commented Oct 9, 2023 • edited Loading

ggerganov left a comment

Choose a reason for hiding this comment

vvhg1 commented Oct 9, 2023

ggerganov commented Oct 9, 2023

vvhg1 Oct 7, 2023 •

edited

Loading

kherud Oct 8, 2023 •

edited

Loading

vvhg1 commented Oct 7, 2023 •

edited

Loading

vvhg1 commented Oct 9, 2023 •

edited

Loading