From 43e8e459c2acfcf538b19e495fbe2edea8abbce9 Mon Sep 17 00:00:00 2001 From: brian khuu Date: Tue, 14 May 2024 23:38:51 +1000 Subject: [PATCH] gguf.md: add BF16 --- docs/gguf.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/docs/gguf.md b/docs/gguf.md index 54f1e9b0b..5bb56c681 100644 --- a/docs/gguf.md +++ b/docs/gguf.md @@ -33,9 +33,10 @@ The components are: - `M`: Million parameters. - `K`: Thousand parameters. 5. **Quantization**: This part specifies how the model parameters are quantized or compressed. - - Uncompressed formats: - - `F16`: 16-bit floats per weight - - `F32`: 32-bit floats per weight + - Floating Representation: + - `BF16`: [bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) 16-bit [Google Brain](https://en.wikipedia.org/wiki/Google_Brain) truncated form of 32-bit IEEE 754 (1 sign bit, 8 exponent bits, 7 fractional bits) + - `F32`: 32-bit IEEE 754 floats per weight (1 sign bit, 8 exponent bits, 23 fractional bits) + - `F16`: 16-bit IEEE 754 floats per weight (1 sign bit, 5 exponent bits, 10 fractional bits) - Quantization (Compression) formats: - `Q`: X bits per weight, where `X` could be `4` (for 4 bits) or `8` (for 8 bits) etc... - Variants provide further details on how the quantized weights are interpreted: