Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

imatrix : use GGUF to store importance matrices #9400

Draft
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

compilade
Copy link
Collaborator

Follow-up from ikawrakow/ik_llama.cpp#15 (reply in thread).

Using GGUF as the format for imatrix files will be useful for further experiments (e.g. with L²QER) and compatibility with existing or future GGUF tooling (e.g. GGUF previews on HuggingFace, graphical GGUF viewer(s) #6715, some kind of gguf-diff, etc.).

There are multiple problems with imatrix which this is addressing:

  • Ad-hoc format which isn't really readable by other projects (and which has no way to backward-compatibly be extended except by adding more stuff at the end)
  • Non-deterministic tensor order depending on unordered_map iteration order (makes sha256sum useless to compare imatrix files made on the same dataset)
  • Broken behavior at small -ub (intermediate saves happen waaay too often)
  • Can't use bigger batch size than chunk size

Summary of changes

  • Use GGUF to store imatrix data.
    • general.type is imatrix
    • no general.architecture
      • can't really know the architecture from old imatrix files.
    • store *.sums and *.counts for each tensors with imatrix data.
      • *.sums are the sums of activations
        • Stored in F32, like before.
      • *.counts are the number of activations (also the number of tokens), useful to calculate the mean
        • Why not simply store the mean? To allow merging imatrix files together with --in-file.
        • It's stored in F32 even though it's integer values, because when calculating the mean it would be converted to F32 anyway to perform the division.
  • Add convert_legacy_imatrix_to_gguf.py to convert old imatrix.dat files to imatrix.gguf
  • Like llama-perplexity since perplexity : support using multiple sequences to allow larger batch sizes #5946, allow computing multiple chunks per batch with llama-imatrix
    • This should be useful for huge models like Llama-405B when they don't fit completely in RAM.
  • Use fused-multiply-add (with std::fma) when accumulating the sums of activations
    • Shouldn't hurt to somewhat reduce rounding errors
      • (obviously f64 would be even better, but I'm not use it's worth it yet. For the curious, using double for the intermediate accumulations can be tried by changing only one line in IMatrixStats: vector<float> values to vector<double> values.)
  • Sort the tensor names before serializing
    • This makes the tensor order deterministic, because otherwise it depended on the iteration order of unordered_map.
      • Determinism between runs means sha256sum can be meaningfully used to compare imatrix files generated in very similar conditions.

TODO

  • Compare old llama-quantize with old imatrix.dat with new llama-quantize using converted imatrix.gguf
    • Seemed to work, but need to re-test. The resulting quantized model(s) should have the same sha256sum.
  • Test new llama-imatrix at different batch sizes
    • Same checksums with -ub 64 -b 512 and -ub 512 -b 2048 for a chunk size of 512 (-c 512)
  • Perplexity test(s) with i-quants with old llama-imatrix vs new llama-imatrix
  • Test with MoE models (perplexity with i-quants should be in the same ballpark as before)
  • Test --in-file with llama-imatrix
  • (maybe) Implement cleaner general.architecture exclusion.
    • Currently, this uses a subclass to make self.add_architecture() a no-op, but maybe general.architecture should simply be excluded when self.arch == "". Not sure how to prevent using the other self.add_* (in GGUFWriter) which expect self.arch to be something.
    • Or maybe the architecture should be included?
      • What about conversions from older imatrix.dat files?

@compilade compilade added enhancement New feature or request breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. refactoring Refactoring examples python python script changes Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level labels Sep 10, 2024
examples/imatrix/imatrix.cpp Outdated Show resolved Hide resolved
examples/imatrix/imatrix.cpp Outdated Show resolved Hide resolved
examples/imatrix/imatrix.cpp Outdated Show resolved Hide resolved
compilade and others added 2 commits September 10, 2024 11:51
Sums and counts tensors no longer need to be consecutive.

* imatrix : more sanity checks when loading multiple imatrix files

* imatrix : use ggml_format_name instead of std::string concatenation

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
@compilade compilade marked this pull request as draft September 13, 2024 03:11
@compilade
Copy link
Collaborator Author

I'm setting this to "draft", because of concerns by @ikawrakow in ikawrakow/ik_llama.cpp#15 (comment) and ikawrakow/ik_llama.cpp#15 (comment) (mostly related to the fact that GGUF is harder to parse than imatrix.dat files).

More details near the end of ikawrakow/ik_llama.cpp#15 (reply in thread).

I'll need some days to think about how to go further with this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. enhancement New feature or request examples python python script changes refactoring Refactoring Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants