Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FP8 KVCache Quantization #575

Closed
ZiyueHuang opened this issue Nov 3, 2023 · 3 comments
Closed

FP8 KVCache Quantization #575

ZiyueHuang opened this issue Nov 3, 2023 · 3 comments

Comments

@ZiyueHuang
Copy link

Hi,

Thanks for the amazing project.

I noticed the vector-wise KVCache quantization technique used in this project, which may decrease the memory usage and also increase the throughput. I have tried FP8 KVCache quantization for Qwen on vLLM, which may cost less cycles, and the results look promising and are posted here.

@ZiyueHuang ZiyueHuang changed the title KVCache Quantization FP8 KVCache Quantization Nov 6, 2023
@JustinLin610
Copy link
Member

I think if it is merged in vllm and we can directly use it then. What would you like us to do for this at this moment?

@leocnj
Copy link

leocnj commented Jan 4, 2024

Hi @ZiyueHuang, are we able to use your cool work (FP8 K-V cache) inside vLLM now? Thanks

@ZiyueHuang
Copy link
Author

@leocnj Thanks for your interest. This feature is not merged into vllm. You can use it by installing from source (pip install -e . on https://github.com/ZiyueHuang/vllm/tree/kv-cache-quant-v0.2.0) and please note ZiyueHuang/vllm@e9dd28e#r131994375 for setup for cutlass, which has been tested on V100 with cuda-11.7. For performance, in some cases the latency (of a single request) might increase a bit, but the overall throughput will likely increase (or, serving more users at the same time).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants