-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
SnapKV support #1881
Comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Feature request
https://github.com/FasterDecoding/SnapKV
Motivation
SnapKV: Cache compression technique for faster LLM generation with less compute and memory
In a recent paper, authors introduced 饾棪饾椈饾棶饾椊饾棡饾棭 as a novel technique which efficiently compresses the key-value (KV) cache in large language models (LLMs), resulting in faster generation with less compute overhead and memory footprint. It compresses KV caches by selecting clustered important KV positions for each attention head.
Your contribution
I'm not sure how much work this may induce or if it is at all feasible (notably enabling sharding with adapters). I'll gladly read any insights on the complexity and the relevance of adding this feature.
The text was updated successfully, but these errors were encountered: