From 43e8e459c2acfcf538b19e495fbe2edea8abbce9 Mon Sep 17 00:00:00 2001
From: brian khuu <mofosyne@gmail.com>
Date: Tue, 14 May 2024 23:38:51 +1000
Subject: [PATCH] gguf.md: add BF16

---
 docs/gguf.md | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/docs/gguf.md b/docs/gguf.md
index 54f1e9b0b..5bb56c681 100644
--- a/docs/gguf.md
+++ b/docs/gguf.md
@@ -33,9 +33,10 @@ The components are:
     - `M`: Million parameters.
     - `K`: Thousand parameters.
 5. **Quantization**: This part specifies how the model parameters are quantized or compressed.
-   - Uncompressed formats:
-     - `F16`: 16-bit floats per weight
-     - `F32`: 32-bit floats per weight
+   - Floating Representation:
+     - `BF16`: [bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) 16-bit [Google Brain](https://en.wikipedia.org/wiki/Google_Brain) truncated form of 32-bit IEEE 754 (1 sign bit, 8 exponent bits, 7 fractional bits)
+     - `F32`: 32-bit IEEE 754 floats per weight (1 sign bit, 8 exponent bits, 23 fractional bits)
+     - `F16`: 16-bit IEEE 754 floats per weight (1 sign bit, 5 exponent bits, 10 fractional bits)
    - Quantization (Compression) formats:
      - `Q<X>`: X bits per weight, where `X` could be `4` (for 4 bits) or `8` (for 8 bits) etc...
      - Variants provide further details on how the quantized weights are interpreted: