FP8 KVCache Quantization #575

ZiyueHuang · 2023-11-03T03:40:10Z

Hi,

Thanks for the amazing project.

I noticed the vector-wise KVCache quantization technique used in this project, which may decrease the memory usage and also increase the throughput. I have tried FP8 KVCache quantization for Qwen on vLLM, which may cost less cycles, and the results look promising and are posted here.

JustinLin610 · 2023-12-30T04:00:54Z

I think if it is merged in vllm and we can directly use it then. What would you like us to do for this at this moment?

leocnj · 2024-01-04T20:50:30Z

Hi @ZiyueHuang, are we able to use your cool work (FP8 K-V cache) inside vLLM now? Thanks

ZiyueHuang · 2024-01-05T04:55:34Z

@leocnj Thanks for your interest. This feature is not merged into vllm. You can use it by installing from source (pip install -e . on https://github.com/ZiyueHuang/vllm/tree/kv-cache-quant-v0.2.0) and please note ZiyueHuang/vllm@e9dd28e#r131994375 for setup for cutlass, which has been tested on V100 with cuda-11.7. For performance, in some cases the latency (of a single request) might increase a bit, but the overall throughput will likely increase (or, serving more users at the same time).

ZiyueHuang changed the title ~~KVCache Quantization~~ FP8 KVCache Quantization Nov 6, 2023

ZiyueHuang closed this as completed Jan 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FP8 KVCache Quantization #575

FP8 KVCache Quantization #575

ZiyueHuang commented Nov 3, 2023

JustinLin610 commented Dec 30, 2023

leocnj commented Jan 4, 2024

ZiyueHuang commented Jan 5, 2024

FP8 KVCache Quantization #575

FP8 KVCache Quantization #575

Comments

ZiyueHuang commented Nov 3, 2023

JustinLin610 commented Dec 30, 2023

leocnj commented Jan 4, 2024

ZiyueHuang commented Jan 5, 2024