A Memory Perspective of Transformer LLMs: Observations and Efficient Inference

Professor Junjie Hu (Computer Sciences and BMI)

Date

Tuesday, December 9, 2025

Time

12:15-1:15 p.m.

Location

Description

Large Language Models (LLMs) rely heavily on the Key-Value (KV) Cache to store their attention context, which serves as the model's short-term memory. While crucial for performance, the cache's size grows linearly with the context length, rapidly becoming a critical memory and latency bottleneck, especially for modern applications involving extensive prompts or long chain-of-thought reasoning. This talk introduces a memory perspective for analyzing Transformer-based LLMs, presenting observations on how contextual significance is distributed and how redundancy accumulates within the cache. Leveraging these findings, we introduce two works that maximize inference efficiency: PyramidKV (COLM 2025) utilizes dynamic KV cache compression based on pyramidal information funneling to preserve essential, structured information for accelerated long-context understanding, and R-KV (NeurIPS 2025) employs redundancy-aware KV cache compression to effectively prune accumulating redundant entries, specifically targeting the memory demands of long CoT reasoning. Collectively, this research provides a novel framework for analyzing and optimizing LLM memory, enabling substantial acceleration and memory savings for practical deployment.

(This talk is part of the weekly Machine Learning Lunch Meetings (MLLM), held every Tuesday from 12:15 to 1:15 p.m. Professors from Computer Sciences, Statistics, ECE, the iSchool, and other departments will discuss their latest research in machine learning, covering both theory and applications. This is a great opportunity to network with faculty and fellow researchers, learn about cutting-edge research at our university, and foster new collaborations. For the talk schedule, please visit https://sites.google.com/view/wiscmllm/home. To receive future weekly talk announcements, please subscribe to our UW Google Group at https://groups.google.com/u/1/a/g-groups.wisc.edu/g/mllm.)

Website

https://sites.google.com/view/wiscmllm/home

Cost

Free

Contact

jerryzhu@cs.wisc.edu

Accessibility

We value inclusion and access for all participants and are pleased to provide reasonable accommodations for this event. Please email jerryzhu@cs.wisc.edu to make a disability-related accommodation request. Reasonable effort will be made to support your request.

Calendar

Click a date to see events on that day.

		March
S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

A Memory Perspective of Transformer LLMs: Observations and Efficient Inference

Tags

Calendar

Search

Categories

Browse events by tag

A Memory Perspective of Transformer LLMs: Observations and Efficient Inference

Event Details

Tags

Calendar

View events by date

Search

Search for events

Categories

Browse events by tag