Efficient KV-Cache Compression for Long-Context and Reasoning Models
ML+X Forum
Event Details
As researchers across disciplines build new LLM-based tools and assistants, understanding how these models manage long inputs and large memory footprints is becoming essential. Join us for our next ML+X forum with Zefan Cai to explore methods that make today's large language models more efficient to deploy—whether you're running RAG pipelines, research copilots, or chat-based analysis tools. Even if you're not focused on model internals, this session offers insight into techniques that directly impact cost, speed, and scalability for real-world applications at UW–Madison.
Please fill out the registration form if you plan to attend (light refreshments provided).
When: Tuesday, Nov 4, 1-2pm CT.
Where: Orchard View, Discovery (and Zoom — passcode 111195)
Abstract: Large language models (LLMs) increasingly handle very long input contexts, and their inference relies on storing key-value (KV) caches for past tokens to avoid redundant computation. However, as context length grows, the memory footprint of full KV caches becomes a major bottleneck. In this talk, I will present two complementary approaches to compressing the KV cache — first, Pyramid KV and second, R-KV: Redundancy-aware KV Cache Compression — highlighting the underlying principles, trade-offs, and practical benefits for inference efficiency. Pyramid KV is motivated by the observation that in transformer-based LLMs, attention flows from broad scopes in lower layers to narrow, focused contexts in higher layers (“pyramidal information funneling”). By allocating more cache budget in lower layers and gradually reducing it in higher layers, Pyramid KV achieves near-full performance while retaining only ~12% of the full KV cache on long-context benchmarks. Building upon this, R-KV targets reasoning-heavy tasks (e.g., chain-of-thought) where long outputs produce very large KV caches. R-KV identifies and prunes redundant tokens in the cache, enabling roughly a 90% memory saving and ~6.6× throughput improvement, while preserving or even slightly improving accuracy compared to the full cache.
Finding Orchard View: The Orchard View room is located on the 3rd floor of Discovery Building—room 3280. To get to the third floor, take the elevator located next to Aldo’s Cafe kitchen (see photo).
This talk is part of a monthly forum hosted by the ML+X community at UW-Madison. Join our Google group to be notified of future events!