Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantization of transformer state for matrix-vector products potentially causes numerical accuracy issues #4755

Closed
cafaxo opened this issue Jan 3, 2024 · 17 comments
Labels

Comments

@cafaxo
Copy link

cafaxo commented Jan 3, 2024

I noticed that sometimes a very odd token is generated in the beginning using the CPU backend.
Example (CPU, zero temperature):

./main -m llama-2-7b.Q4_K_M.gguf --samplers "temp" --temp 0 -p "The Julia programming language." --no-penalize-nl --top-k 0 --top-p 1.0 --min-p 0.0 --repeat-penalty 1.0 -ngl 0 --mlock
[...]
 The Julia programming language. surely, the Julia programming language is a

Example (GPU, zero temperature):

./main -m llama-2-7b.Q4_K_M.gguf --samplers "temp" --temp 0 -p "The Julia programming language." --no-penalize-nl --top-k 0 --top-p 1.0 --min-p 0.0 --repeat-penalty 1.0
[...]
 The Julia programming language.
Julia is a high-level, [...]

The "surely," token is nonsense. The different output is caused by the following difference between the CPU and GPU backends:
Consider a matrix-vector product A*x where A is quantized (e.g. q4) and x is not quantized (e.g. float32).
The CPU backend quantizes x to q8 and then computes the product using an optimized vecdot(q4, q8) routine.
The GPU backend dequantizes A to float32 and then computes the product using float32.

I tried to find out why these kind of nonsense tokens are generated right at the beginning. I have a suspicion, but no definite answer:
Transformer states can be quite sparse at the initial tokens. Example state in the third layer resulting from the first token:
image
These spikes amplify the quantization errors in the corresponding blocks:
image

Transformer states seem to become less sparse / more diffuse over time. I am not completely sure that these spikes are really the cause, but it seems that the q8 quantization in the CPU backend definitely hurts numerical accuracy of matrix-vector products.

@JohannesGaessler
Copy link
Collaborator

Consider a matrix-vector product A*x where A is quantized (e.g. q4) and x is not quantized (e.g. float32).
The CPU backend quantizes x to q8 and then computes the product using an optimized vecdot(q4, q8) routine.
The GPU backend dequantizes A to float32 and then computes the product using float32.

This is incorrect. It is true that the CUDA backend can use cuBLAS to do matrix matrix multiplications as FP16/FP32. But cuBLAS is only used on Volta or newer and only if the batch size is > 32. In all other cases mul_mat_q, a matrix matrix multiplication where the hidden state is quantized to q8_1 is used. This implementation should be equivalent to the CPU implementation within rounding error. In particular, the example you provided does not use cuBLAS GEMM.

It's still possible that there is something wrong with the CPU implementation but the quantization of the hidden state is not the cause.

@cafaxo
Copy link
Author

cafaxo commented Jan 3, 2024

I am using the Metal GPU backend. If I am reading this correctly, it dequantizes here:

dequantize_func(x, il, temp_a);

@JohannesGaessler
Copy link
Collaborator

Okay, I don't know what the Metal code does. In any case, this is the output I get with the CUDA code that quantizes the hidden state to q8_1:

The Julia programming language.
Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and

You only posted the first few tokens from your result but I am not observing numerical issues for this test case.

@ggerganov
Copy link
Owner

This seems very similar to my observations in #2421

I still don't have a good understanding of why this happens, but it seems LLaMA v2 is much more susceptible compared to LLaMA v1 for some reason. Given that CUDA quantizes the hidden state, but does not reproduce the behaviour the root cause might be somewhere else. Anyway, looking forward to further analysis

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Jan 3, 2024

I think there is something going wrong in the first layer but I don't think it's the matrix multiplication. I edited ggml_cuda_can_mul_mat to always do matrix multiplications on the GPU regardless of batch size but I still get garbage outputs until all repeating layers are on the GPU:

> ./main -m models/nvme/llama_2-7b-q4_k_m.gguf --samplers "temp" --temp 0 -p "The Julia programming" --no-penalize-nl --top-k 0 --top-p 1.0 --min-p 0.0 --repeat-penalty 1.0 -ngl 31 --mlock --n-predict 50
[...]
The Julia programming language▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅
> ./main -m models/nvme/llama_2-7b-q4_k_m.gguf --samplers "temp" --temp 0 -p "The Julia programming" --no-penalize-nl --top-k 0 --top-p 1.0 --min-p 0.0 --repeat-penalty 1.0 -ngl 32 --mlock --n-predict 50
[...]
The Julia programming language is a high-level, high-performance dynamic programming language. It is designed to be easy to use, easy to learn, and easy to implement. It is also designed to be fast, efficient, and easy to maintain.
Jul

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Jan 3, 2024

I retract my previous statement. I did another test where I compiled with LLAMA_OPENBLAS and edited ggml_compute_forward_use_blas to use OpenBLAS for batch sizes >= 2 and this fixes the issue:

The Julia programming language.
Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and

So the issue seems to be CPU matrix multiplication after all. Specifically, it seems the issue only occurs if the non-OpenBLAS matrix multiplication is used for q4_K tensors. So there may be something subtly wrong with the matrix multiplication of either the format or the specific matrices where it is used. But I get good outputs with q3_K_L so I think the problem is q4_K matrix multiplication.

@JohannesGaessler
Copy link
Collaborator

In another test q3_K_M is also working as expected despite containing q4_K tensors; this issue is difficult to pin down.

@cafaxo
Copy link
Author

cafaxo commented Jan 3, 2024

I still think that the quantization is the culprit:
The CPU backend quantizes to q8_K, which has block size 256.
CUDA quantizes to q8_1, which has block size 32.

This gives a hint that the spikes might actually be the problem: block size 32 is less sensitive to spikes.

I ran some tests and can confirm that just changing the quantization to q8_1 instead of q8_K gets rid of the "surely" token.

@JohannesGaessler
Copy link
Collaborator

I am currently trying to do a CUDA implementation for matrix multiplication that utilizes int8 tensor cores. A major issue is that loading the results from tensor cores has terrible performance. So I will soon try an implementation where the inputs are quantized as int8 but with a single scale per row/column. If the issue is in fact that the CPU quantization block size is too large the quality of this implementation should be bad; I'll report back when I have a working prototype.

@JohannesGaessler
Copy link
Collaborator

I implemented a prototype for 8 bit quantization of the hidden state with only a single scale per column. The resulting implementation has very similar issues to the CPU implementation. This is the output that I get:

The Julia programming language.Љаулa: 1.1.1.
The Julia programming language.
The Julia programming language is a free, open-source, and fast-gungu, which is a free, open-source, and fast

In this case the prompt was "The Julia programming language.". But more generally, every time after a period there is a high likelihood that the next token is garbage even if the prompt does not end with a period. So I think this really is an issue related to numerics and the large block size for the CPU backend.

@JohannesGaessler
Copy link
Collaborator

I've been thinking, if a block size of 256 for quantizing the hidden state really causes garbage tokens upon punctuation, how do we know that a block size of 32 isn't still causing some form of damage? Does llama.cpp have a built-in way of looking at token probabilities?

@JohannesGaessler
Copy link
Collaborator

I generated some samples for the prompt "The Julia programming language." using either mul_mat_q (quantizes hidden state to q8_1) or dequantize_mul_mat_vec (does not quantize the hidden state) for prompt processing. Subjectively I do not feel like mul_mat_q produced more garbage tokens.

@JohannesGaessler
Copy link
Collaborator

I did some more testing: when using a single scale per hidden state column I need 10 bits per weight to get a continuation for "The Julia programming language." using 7b q8_0 that isn't garbage:

The Julia programming language. Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and an extensive

For 11 or more bits per weight I always get:

The Julia programming language.
Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and

@slaren
Copy link
Collaborator

slaren commented Jan 3, 2024

Does llama.cpp have a built-in way of looking at token probabilities?

Not sure if it is exactly what you need, but the default site of the server example has an option to show token probabilities.

@Sixzero
Copy link

Sixzero commented Jan 4, 2024

How come GPU can do what the 10bit solution does on CPU. 🤔

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity.

Copy link
Contributor

github-actions bot commented Apr 2, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants