Search:
Match:
1 results

Analysis

This paper proposes a novel approach to long-context language modeling by framing it as a continual learning problem. The core idea is to use a standard Transformer architecture with sliding-window attention and enable the model to learn at test time through next-token prediction. This End-to-End Test-Time Training (TTT-E2E) approach, combined with meta-learning for improved initialization, demonstrates impressive scaling properties, matching full attention performance while maintaining constant inference latency. This is a significant advancement as it addresses the limitations of existing long-context models, such as Mamba and Gated DeltaNet, which struggle to scale effectively. The constant inference latency is a key advantage, making it faster than full attention for long contexts.
Reference

TTT-E2E scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7 times faster than full attention for 128K context.