SnapKV support #1881

icyxp · 2024-05-13T02:42:20Z

Feature request

https://github.com/FasterDecoding/SnapKV

Motivation

SnapKV: Cache compression technique for faster LLM generation with less compute and memory

In a recent paper, authors introduced 𝗦𝗻𝗮𝗽𝗞𝗩 as a novel technique which efficiently compresses the key-value (KV) cache in large language models (LLMs), resulting in faster generation with less compute overhead and memory footprint. It compresses KV caches by selecting clustered important KV positions for each attention head.

Your contribution

I'm not sure how much work this may induce or if it is at all feasible (notably enabling sharding with adapters). I'll gladly read any insights on the complexity and the relevance of adding this feature.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SnapKV support #1881

SnapKV support #1881

icyxp commented May 13, 2024

SnapKV support #1881

SnapKV support #1881

Comments

icyxp commented May 13, 2024

Feature request

Motivation

Your contribution