Research Paper#Language Modeling, Transformers, Continual Learning, Test-Time Training🔬 ResearchAnalyzed: Jan 3, 2026 16:01
End-to-End Test-Time Training for Long Context Language Modeling
Published:Dec 29, 2025 18:30
•2 min read
•ArXiv
Analysis
This paper proposes a novel approach to long-context language modeling by framing it as a continual learning problem. The core idea is to use a standard Transformer architecture with sliding-window attention and enable the model to learn at test time through next-token prediction. This End-to-End Test-Time Training (TTT-E2E) approach, combined with meta-learning for improved initialization, demonstrates impressive scaling properties, matching full attention performance while maintaining constant inference latency. This is a significant advancement as it addresses the limitations of existing long-context models, such as Mamba and Gated DeltaNet, which struggle to scale effectively. The constant inference latency is a key advantage, making it faster than full attention for long contexts.
Key Takeaways
- •Proposes a novel approach to long-context language modeling using End-to-End Test-Time Training (TTT-E2E).
- •Employs a standard Transformer architecture with sliding-window attention.
- •Achieves scaling properties comparable to full attention while maintaining constant inference latency.
- •Outperforms existing long-context models like Mamba and Gated DeltaNet in terms of scaling.
- •Offers significant speed advantages over full attention for long contexts.
Reference
“TTT-E2E scales with context length in the same way as Transformer with full attention, while others, such as Mamba 2 and Gated DeltaNet, do not. However, similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7 times faster than full attention for 128K context.”