Skip to content

Commit

Permalink
gguf.md: add BF16
Browse files Browse the repository at this point in the history
  • Loading branch information
mofosyne committed May 14, 2024
1 parent bda6204 commit 43e8e45
Showing 1 changed file with 4 additions and 3 deletions.
7 changes: 4 additions & 3 deletions docs/gguf.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,10 @@ The components are:
- `M`: Million parameters.
- `K`: Thousand parameters.
5. **Quantization**: This part specifies how the model parameters are quantized or compressed.
- Uncompressed formats:
- `F16`: 16-bit floats per weight
- `F32`: 32-bit floats per weight
- Floating Representation:
- `BF16`: [bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) 16-bit [Google Brain](https://en.wikipedia.org/wiki/Google_Brain) truncated form of 32-bit IEEE 754 (1 sign bit, 8 exponent bits, 7 fractional bits)
- `F32`: 32-bit IEEE 754 floats per weight (1 sign bit, 8 exponent bits, 23 fractional bits)
- `F16`: 16-bit IEEE 754 floats per weight (1 sign bit, 5 exponent bits, 10 fractional bits)
- Quantization (Compression) formats:
- `Q<X>`: X bits per weight, where `X` could be `4` (for 4 bits) or `8` (for 8 bits) etc...
- Variants provide further details on how the quantized weights are interpreted:
Expand Down

0 comments on commit 43e8e45

Please sign in to comment.