Trellis: Compressing KV Memory in Transformers

Published:Dec 29, 2025 20:32
1 min read
ArXiv

Analysis

This paper addresses the critical issue of quadratic complexity and memory constraints in Transformers, particularly in long-context applications. By introducing Trellis, a novel architecture that dynamically compresses the Key-Value cache, the authors propose a practical solution to improve efficiency and scalability. The use of a two-pass recurrent compression mechanism and online gradient descent with a forget gate is a key innovation. The demonstrated performance gains, especially with increasing sequence length, suggest significant potential for long-context tasks.

Reference

Trellis replaces the standard KV cache with a fixed-size memory and train a two-pass recurrent compression mechanism to store new keys and values into memory.