Adding SqueezeLLM Support #3093

chooper1 · 2023-09-09T04:37:56Z

SqueezeLLM Support

This PR adds support for the SqueezeLLM quantization method, which is described in the following preprint: https://arxiv.org/abs/2306.07629, and which has open-source GPU inference code available at: https://github.com/SqueezeAILab/SqueezeLLM. SqueezeLLM is a post-training quantization framework that allows for high-accuracy and runtime-efficient quantization at low bit precision.

This PR contains the inference code to run the 4-bit dense-only non-uniform quantization scheme outlined in the preprint, as well as the code required to convert the Huggingface (PyTorch) checkpoints to the required binary format in order to be compatible with the llama.cpp file loader.

SqueezeLLM leverages non-uniform quantization to better represent the underlying distribution by shifting the quantization signposts to the optimal positions. SqueezeLLM has promising performance both from an accuracy perspective as well as in terms of runtime efficiency relative to existing integrated quantization methods. The runtime per token was benchmarked on an M1 for 128 tokens without metal, using checkpoints from: https://huggingface.co/TheBloke/LLaMa-7B-GGML/:

(Edit - numbers updated to match with the comments below, and to also include the model size with 8-bit embedding quantization. Precision estimates are from the link above)

Quantized checkpoints are publicly available for a range of popular models, including LLaMA1/2, Vicuna 1.1/1.3/1.5, xGEN, and OPT: https://huggingface.co/squeeze-ai-lab

Example usage (for the 7B LLaMA-2 model):

Build without metal (run LLAMA_NO_METAL=1 make)
Download the model from https://huggingface.co/squeeze-ai-lab/sq-llama-2-7b-w4-s0 (copying the .pt file into the directory “models/7B/sq-llama-2-7b-w4-s0”)
Copy the llama-2 tokenizer.model and config.json files into the sq-llama-2-7b-w4-s0 folder
Converting the pytorch checkpoint to ggml format:
python convert-sqllm-to-gguf.py --outtype q4_sq models/7B/sq-llama-2-7b-w4-s0/sq-llama-2-7b-w4-s0.pt --outfile models/7B/sq-llama-2-7b-w4-s0-fp16.gguf -squeezellm . The --equant flag can also be passed to quantize input and output embeddings to 8 bits.
Running generation using the converted checkpoint:
./main -m models/7B/sq-llama-2-7b-w4-s0-fp16.gguf -n 128

casper-hansen · 2023-09-09T07:35:38Z

It seems this is an inference-only implementation? How can we quantize new/custom models with SqueezeLM?

ggerganov · 2023-09-09T08:57:26Z

Interesting work!

Can you try to update the ggml -> gguf convert step - it currently fails:

python3 convert-llama-ggml-to-gguf.py --input models/llama-7b-v2/ggml-model-q4_sq.bin --output models/llama-7b-v2/ggml-model-q4_sq.gguf --squeezellm
* Using config: Namespace(input=PosixPath('models/llama-7b-v2/ggml-model-q4_sq.bin'), output=PosixPath('models/llama-7b-v2/ggml-model-q4_sq.gguf'), name=None, desc=None, gqa=1, eps='5.0e-06', context_length=2048, model_metadata_dir=None, vocab_dir=None, vocabtype='spm', squeezellm=True)

=== WARNING === Be aware that this conversion script is best-effort. Use a native GGUF model if possible. === WARNING ===

- Note: If converting LLaMA2, specifying "--eps 1e-5" is required. 70B models also need "--gqa 8".
* Scanning GGML input file
* File format: GGJTv1 with ftype MOSTLY_Q5_K_S
* GGML model hyperparameters: <Hyperparameters: n_vocab=32000, n_embd=4096, n_mult=256, n_head=32, n_layer=0, n_rot=128, n_ff=11008, ftype=MOSTLY_Q5_K_S>

=== WARNING === Special tokens may not be converted correctly. Use --model-metadata-dir if possible === WARNING ===

* Preparing to save GGUF file
* Adding model parameters and KV items
* Adding 32000 vocab item(s)
* Adding 291 tensor(s)
Traceback (most recent call last):
  File "/Users/ggerganov/development/github/llama.cpp/convert-llama-ggml-to-gguf.py", line 458, in <module>
    main()
  File "/Users/ggerganov/development/github/llama.cpp/convert-llama-ggml-to-gguf.py", line 454, in main
    converter.save()
  File "/Users/ggerganov/development/github/llama.cpp/convert-llama-ggml-to-gguf.py", line 252, in save
    self.add_tensors(gguf_writer)
  File "/Users/ggerganov/development/github/llama.cpp/convert-llama-ggml-to-gguf.py", line 360, in add_tensors
    assert mapped_name is not None, f'Bad name {name}'
           ^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Bad name layers.0.attention.wq.weight

Edit:

Also, double check your "Model Size" column. These are the correct numbers in bytes:

ls -l models/llama-7b-v2/
total 197602496
-rw-r--r--  1 ggerganov  staff  13478104576 Aug 30 11:27 ggml-model-f16.gguf
-rw-r--r--  1 ggerganov  staff  26954272000 Aug 26 23:18 ggml-model-f32.gguf
-rw-r--r--  1 ggerganov  staff   2825940544 Aug 30 11:53 ggml-model-q2_k.gguf
-rw-r--r--  1 ggerganov  staff   3298004544 Aug 30 11:53 ggml-model-q3_k.gguf
-rw-r--r--  1 ggerganov  staff   2948304448 Sep  2 10:21 ggml-model-q3_k_s.gguf
-rw-r--r--  1 ggerganov  staff   3825806912 Aug 30 11:52 ggml-model-q4_0.gguf
-rw-r--r--  1 ggerganov  staff   4238749248 Aug 30 11:52 ggml-model-q4_1.gguf
-rw-r--r--  1 ggerganov  staff   4081004096 Aug 30 11:52 ggml-model-q4_k.gguf
-rw-r--r--  1 ggerganov  staff   3856739904 Sep  2 10:21 ggml-model-q4_k_s.gguf
-rw-r--r--  1 ggerganov  staff   3807322752 Sep  9 11:38 ggml-model-q4_sq.bin
-rw-r--r--  1 ggerganov  staff   4651691584 Aug 30 11:52 ggml-model-q5_0.gguf
-rw-r--r--  1 ggerganov  staff   5064633920 Aug 30 11:52 ggml-model-q5_1.gguf
-rw-r--r--  1 ggerganov  staff   4783156800 Aug 30 11:52 ggml-model-q5_k.gguf
-rw-r--r--  1 ggerganov  staff   4651691584 Sep  2 10:20 ggml-model-q5_k_s.gguf
-rw-r--r--  1 ggerganov  staff   5529194048 Aug 30 11:51 ggml-model-q6_k.gguf
-rw-r--r--  1 ggerganov  staff   7161089600 Aug 30 11:51 ggml-model-q8_0.gguf

Divide by 1e9 to get size in GB. For example, ggml-model-q4_sq.bin is 3.81 GB

KerfuffleV2 · 2023-09-09T09:34:42Z

AssertionError: Bad name layers.0.attention.wq.weight

I think this is because the name doesn't need to be mapped and the name mapping stuff doesn't support an identity operation. We should probably fix that, it would be an easy change.

edit: Fixed it with #3095

chooper1 · 2023-09-09T19:52:16Z

Interesting work!

Can you try to update the ggml -> gguf convert step - it currently fails:

python3 convert-llama-ggml-to-gguf.py --input models/llama-7b-v2/ggml-model-q4_sq.bin --output models/llama-7b-v2/ggml-model-q4_sq.gguf --squeezellm
* Using config: Namespace(input=PosixPath('models/llama-7b-v2/ggml-model-q4_sq.bin'), output=PosixPath('models/llama-7b-v2/ggml-model-q4_sq.gguf'), name=None, desc=None, gqa=1, eps='5.0e-06', context_length=2048, model_metadata_dir=None, vocab_dir=None, vocabtype='spm', squeezellm=True)

=== WARNING === Be aware that this conversion script is best-effort. Use a native GGUF model if possible. === WARNING ===

- Note: If converting LLaMA2, specifying "--eps 1e-5" is required. 70B models also need "--gqa 8".
* Scanning GGML input file
* File format: GGJTv1 with ftype MOSTLY_Q5_K_S
* GGML model hyperparameters: <Hyperparameters: n_vocab=32000, n_embd=4096, n_mult=256, n_head=32, n_layer=0, n_rot=128, n_ff=11008, ftype=MOSTLY_Q5_K_S>

=== WARNING === Special tokens may not be converted correctly. Use --model-metadata-dir if possible === WARNING ===

* Preparing to save GGUF file
* Adding model parameters and KV items
* Adding 32000 vocab item(s)
* Adding 291 tensor(s)
Traceback (most recent call last):
  File "/Users/ggerganov/development/github/llama.cpp/convert-llama-ggml-to-gguf.py", line 458, in <module>
    main()
  File "/Users/ggerganov/development/github/llama.cpp/convert-llama-ggml-to-gguf.py", line 454, in main
    converter.save()
  File "/Users/ggerganov/development/github/llama.cpp/convert-llama-ggml-to-gguf.py", line 252, in save
    self.add_tensors(gguf_writer)
  File "/Users/ggerganov/development/github/llama.cpp/convert-llama-ggml-to-gguf.py", line 360, in add_tensors
    assert mapped_name is not None, f'Bad name {name}'
           ^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Bad name layers.0.attention.wq.weight

Thank you for trying this out! I fixed the steps listed in the comment above above - you also need to copy the config.json file as well as the tokenizer file before running the first conversion step (pt -> ggml). Please let me know if there are any issues with this.

KerfuffleV2 · 2023-09-09T21:36:17Z

Is there a reason to include a way to convert to GGML? As far as I know, there isn't really a use case for creating new GGML format files so you can probably make your life easier by just not having to worry about that.

ikawrakow · 2023-09-10T09:22:40Z

Adding SqeezeLLM to llama.cpp is really great! People have been asking for it for so long.

It would also be great if the table was updated with the actual perplexity values for the various existing quantization types listed in the table. For instance, the current LLaMA-v1-7B perplexity for Q4_0 is 6.1213, not 6.16. In the same way, we have PPL(Q4_K_S) = 6.0067, not 6.05 as per the table. Q4_K_S perplexity has never been 6.05: in the initial k_quants PR (#1684), Q4_K_S perplexity was 6.0215, it became 6.0067 after the re-tuning in PR #2816 (which also resulted in a slight increase in model size, from 3.83 GB to 3.86 GB as listed in the table). Q4_0 dropped from 6.1563 to 6.1213 when we started quantizing the output.weight tensor with Q6_K. If I remember correctly, the change happened in early June, so quite some time ago.

As the provided SqueezeLLM perplexity is for LLaMA-v1-7B, I was curious to see how this approach will perform for LLaMA-v2-7B. I followed the instructions to create the GGUF model and ran the perplexity tool. Calculation is very slow (6+ hours on my M2 Max), so I stopped after 344 batches. At that point, SqueezeLLM perplexity was higher than Q4_K_S by 0.0844. The experience is that after 300 batches the perplexity difference between two models is the same as the difference for the full calculation within +/- 0.002. So, the projected SqueezeLLM perplexity for LLaMA-v2-7B is 5.96-5.97. This is to be compared with PPL(Q4_0) = 5.94 and PPL(Q4_K_S) = 5.88.

On the bright side, I'm noticing that the provided model does not quantize the tok_embeddings and output.weight tensors. We know that one can quantize tok_embeddings with Q4_0 and output.weight with Q8_0 with negligible loss in accuracy. This would shave off 212 MB from the model size, so the SqeezeLLM model would become 3.6 GB and hence become comparable in size to Q3_K_L, which is 3.55 GB and has a LLaMA-v2-7B perplexity of 5.9811. The model size vs perplexity for LLaMA-v2-7B is illustrated in the following figure:

Update: I downloaded a quantized model from https://huggingface.co/TheBloke/LLaMa-7B-GGML/ and I see that, indeed, in these models output.weight has not been quantized with Q6_K. Going back the commit history, I see that Q6_K quantization of output.weight has been disabled between commits 7a74dee (June 6) and 74a6d92 (June 12). @TheBloke must have prepared the GGMLV3 quantized models posted on HF in exactly that period.

chooper1 · 2023-09-13T05:47:35Z

Is there a reason to include a way to convert to GGML? As far as I know, there isn't really a use case for creating new GGML format files so you can probably make your life easier by just not having to worry about that.

Thank you for the feedback, I've updated the file conversion code to convert directly to gguf!

chooper1 · 2023-09-13T18:56:56Z

Adding SqeezeLLM to llama.cpp is really great! People have been asking for it for so long.

It would also be great if the table was updated with the actual perplexity values for the various existing quantization types listed in the table. For instance, the current LLaMA-v1-7B perplexity for Q4_0 is 6.1213, not 6.16. In the same way, we have PPL(Q4_K_S) = 6.0067, not 6.05 as per the table. Q4_K_S perplexity has never been 6.05: in the initial k_quants PR (#1684), Q4_K_S perplexity was 6.0215, it became 6.0067 after the re-tuning in PR #2816 (which also resulted in a slight increase in model size, from 3.83 GB to 3.86 GB as listed in the table). Q4_0 dropped from 6.1563 to 6.1213 when we started quantizing the output.weight tensor with Q6_K. If I remember correctly, the change happened in early June, so quite some time ago.

As the provided SqueezeLLM perplexity is for LLaMA-v1-7B, I was curious to see how this approach will perform for LLaMA-v2-7B. I followed the instructions to create the GGUF model and ran the perplexity tool. Calculation is very slow (6+ hours on my M2 Max), so I stopped after 344 batches. At that point, SqueezeLLM perplexity was higher than Q4_K_S by 0.0844. The experience is that after 300 batches the perplexity difference between two models is the same as the difference for the full calculation within +/- 0.002. So, the projected SqueezeLLM perplexity for LLaMA-v2-7B is 5.96-5.97. This is to be compared with PPL(Q4_0) = 5.94 and PPL(Q4_K_S) = 5.88.

On the bright side, I'm noticing that the provided model does not quantize the tok_embeddings and output.weight tensors. We know that one can quantize tok_embeddings with Q4_0 and output.weight with Q8_0 with negligible loss in accuracy. This would shave off 212 MB from the model size, so the SqeezeLLM model would become 3.6 GB and hence become comparable in size to Q3_K_L, which is 3.55 GB and has a LLaMA-v2-7B perplexity of 5.9811. The model size vs perplexity for LLaMA-v2-7B is illustrated in the following figure:

Update: I downloaded a quantized model from https://huggingface.co/TheBloke/LLaMa-7B-GGML/ and I see that, indeed, in these models output.weight has not been quantized with Q6_K. Going back the commit history, I see that Q6_K quantization of output.weight has been disabled between commits 7a74dee (June 6) and 74a6d92 (June 12). @TheBloke must have prepared the GGMLV3 quantized models posted on HF in exactly that period.

Thank you for pointing this out! I've updated the table to match the updated perplexities for the existing quantization methods. I've also added a row to correspond to quantizing the input and output embeddings to Q8_0. I'll test quantizing the input embedding to Q4_0 to see if we can also do this with minimal degradation.

chooper1 · 2023-09-15T18:18:31Z

@ggerganov @KerfuffleV2 @ikawrakow Just following up to see if there was any additional feedback or if the updates look good, please let me know what else we need to do to integrate this!

KerfuffleV2 · 2023-09-19T05:44:32Z

I'm hesitant to post stuff in pulls that isn't really contributing since it spams peoples' notifications but I just wanted to say I'm not ignoring the @ . I'm just not really a person with the ability/authority to decide whether it gets accepted. Unfortunately, I also haven't really had a chance to look too closely at this yet so I don't have any other feedback to add right now. edit: I think I can make the checks run for you though.

ggerganov · 2023-09-27T11:56:46Z

I'm hesitant to integrate this for a few reasons:

There is no model quantization implementation provided. Not sure how difficult this would be
There is no AVX and GPU kernels yet. Would the table lookup still be efficient in these cases?
Adding the above would be a huge amount of extra code to maintain and I am not sure it is worth it given the current results

For now, we should keep this branch as a PoC

chooper1 added 2 commits September 8, 2023 21:29

First draft of SqueezeLLM PR

c6b0ebb

First draft of SqueezeLLM PR

11f7224

chooper1 marked this pull request as ready for review September 9, 2023 05:09

cebtenzzre mentioned this pull request Sep 9, 2023

gguf-py: Support identity operation in TensorNameMap #3095

Merged

Updated file conversion and simplified activation logic

7954f8d

chooper1 added 2 commits September 18, 2023 20:17

CI Fix

80f6969

CI Fix

787bc4a

ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding SqueezeLLM Support #3093

Adding SqueezeLLM Support #3093

chooper1 commented Sep 9, 2023 •

edited

Loading

casper-hansen commented Sep 9, 2023

ggerganov commented Sep 9, 2023 •

edited

Loading

KerfuffleV2 commented Sep 9, 2023 •

edited

Loading

chooper1 commented Sep 9, 2023 •

edited

Loading

KerfuffleV2 commented Sep 9, 2023

ikawrakow commented Sep 10, 2023 •

edited

Loading

chooper1 commented Sep 13, 2023

chooper1 commented Sep 13, 2023

chooper1 commented Sep 15, 2023

KerfuffleV2 commented Sep 19, 2023 •

edited

Loading

ggerganov commented Sep 27, 2023

Adding SqueezeLLM Support #3093

Are you sure you want to change the base?

Adding SqueezeLLM Support #3093

Conversation

chooper1 commented Sep 9, 2023 • edited Loading

SqueezeLLM Support

casper-hansen commented Sep 9, 2023

ggerganov commented Sep 9, 2023 • edited Loading

KerfuffleV2 commented Sep 9, 2023 • edited Loading

chooper1 commented Sep 9, 2023 • edited Loading

KerfuffleV2 commented Sep 9, 2023

ikawrakow commented Sep 10, 2023 • edited Loading

chooper1 commented Sep 13, 2023

chooper1 commented Sep 13, 2023

chooper1 commented Sep 15, 2023

KerfuffleV2 commented Sep 19, 2023 • edited Loading

ggerganov commented Sep 27, 2023

chooper1 commented Sep 9, 2023 •

edited

Loading

ggerganov commented Sep 9, 2023 •

edited

Loading

KerfuffleV2 commented Sep 9, 2023 •

edited

Loading

chooper1 commented Sep 9, 2023 •

edited

Loading

ikawrakow commented Sep 10, 2023 •

edited

Loading

KerfuffleV2 commented Sep 19, 2023 •

edited

Loading