imatrix : use GGUF to store importance matrices #9400

compilade · 2024-09-10T02:14:44Z

Follow-up from ikawrakow/ik_llama.cpp#15 (reply in thread).

Using GGUF as the format for imatrix files will be useful for further experiments (e.g. with L²QER) and compatibility with existing or future GGUF tooling (e.g. GGUF previews on HuggingFace, graphical GGUF viewer(s) #6715, some kind of gguf-diff, etc.).

There are multiple problems with imatrix which this is addressing:

Ad-hoc format which isn't really readable by other projects (and which has no way to backward-compatibly be extended except by adding more stuff at the end)
Non-deterministic tensor order depending on unordered_map iteration order (makes sha256sum useless to compare imatrix files made on the same dataset)
Broken behavior at small -ub (intermediate saves happen waaay too often)
Can't use bigger batch size than chunk size

Summary of changes

Use GGUF to store imatrix data.
- general.type is imatrix
- no general.architecture
  - can't really know the architecture from old imatrix files.
- store *.sums and *.counts for each tensors with imatrix data.
  - *.sums are the sums of activations
    - Stored in F32, like before.
  - *.counts are the number of activations (also the number of tokens), useful to calculate the mean
    - Why not simply store the mean? To allow merging imatrix files together with --in-file.
    - It's stored in F32 even though it's integer values, because when calculating the mean it would be converted to F32 anyway to perform the division.
Add convert_legacy_imatrix_to_gguf.py to convert old imatrix.dat files to imatrix.gguf
Like llama-perplexity since perplexity : support using multiple sequences to allow larger batch sizes #5946, allow computing multiple chunks per batch with llama-imatrix
- This should be useful for huge models like Llama-405B when they don't fit completely in RAM.
Use fused-multiply-add (with std::fma) when accumulating the sums of activations
- Shouldn't hurt to somewhat reduce rounding errors
  - (obviously f64 would be even better, but I'm not use it's worth it yet. For the curious, using double for the intermediate accumulations can be tried by changing only one line in IMatrixStats: vector<float> values to vector<double> values.)
Sort the tensor names before serializing
- This makes the tensor order deterministic, because otherwise it depended on the iteration order of unordered_map.
  - Determinism between runs means sha256sum can be meaningfully used to compare imatrix files generated in very similar conditions.

TODO

Compare old llama-quantize with old imatrix.dat with new llama-quantize using converted imatrix.gguf
- Seemed to work, but need to re-test. The resulting quantized model(s) should have the same sha256sum.
Test new llama-imatrix at different batch sizes
- Same checksums with -ub 64 -b 512 and -ub 512 -b 2048 for a chunk size of 512 (-c 512)
Perplexity test(s) with i-quants with old llama-imatrix vs new llama-imatrix
Test with MoE models (perplexity with i-quants should be in the same ballpark as before)
Test --in-file with llama-imatrix
(maybe) Implement cleaner general.architecture exclusion.
- Currently, this uses a subclass to make self.add_architecture() a no-op, but maybe general.architecture should simply be excluded when self.arch == "". Not sure how to prevent using the other self.add_* (in GGUFWriter) which expect self.arch to be something.
- Or maybe the architecture should be included?
  - What about conversions from older imatrix.dat files?

I have read the contributing guidelines
Self-reported review complexity:
- Medium

* perplexity : simplify filling the batch

examples/imatrix/imatrix.cpp

Sums and counts tensors no longer need to be consecutive. * imatrix : more sanity checks when loading multiple imatrix files * imatrix : use ggml_format_name instead of std::string concatenation Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

compilade · 2024-09-13T03:16:15Z

I'm setting this to "draft", because of concerns by @ikawrakow in ikawrakow/ik_llama.cpp#15 (comment) and ikawrakow/ik_llama.cpp#15 (comment) (mostly related to the fact that GGUF is harder to parse than imatrix.dat files).

More details near the end of ikawrakow/ik_llama.cpp#15 (reply in thread).

I'll need some days to think about how to go further with this.

compilade added 8 commits August 20, 2024 15:17

imatrix : allow processing multiple chunks per batch

bce5464

* perplexity : simplify filling the batch

imatrix : fix segfault when using a single chunk per batch

347247a

imatrix : use GGUF to store imatrix data

3de9300

imatrix : fix conversion problems

c8ab6a3

Merge branch 'master' into compilade/imatrix-batched-chunks

3ad0603

imatrix : use FMA and sort tensor names

d19101c

py : add requirements for legacy imatrix convert script

503630e

perplexity : revert changes

9e6b0e9

compilade added 3 commits September 9, 2024 22:20

py : include imatrix converter requirements in toplevel requirements

894ed8d

imatrix : avoid using designated initializers in C++

efa9186

imatrix : remove unused n_entries

2217247

ngxson reviewed Sep 10, 2024

View reviewed changes

examples/imatrix/imatrix.cpp Outdated Show resolved Hide resolved

examples/imatrix/imatrix.cpp Outdated Show resolved Hide resolved

examples/imatrix/imatrix.cpp Outdated Show resolved Hide resolved

compilade and others added 2 commits September 10, 2024 11:51

quantize : use unused imatrix chunk_size with LLAMA_TRACE

2d79a70

compilade marked this pull request as draft September 13, 2024 03:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

imatrix : use GGUF to store importance matrices #9400

imatrix : use GGUF to store importance matrices #9400

compilade commented Sep 10, 2024

compilade commented Sep 13, 2024

imatrix : use GGUF to store importance matrices #9400

Are you sure you want to change the base?

imatrix : use GGUF to store importance matrices #9400

Conversation

compilade commented Sep 10, 2024

Summary of changes

TODO

compilade commented Sep 13, 2024