Learning Dynamic Global Attention in LLMs
Published:Dec 27, 2025 11:21
•1 min read
•ArXiv
Analysis
This paper introduces All-or-Here Attention (AHA), a method for Large Language Models (LLMs) to dynamically decide when to attend to global context. This is significant because it addresses the computational cost of full attention, a major bottleneck in LLM inference. By using a binary router, AHA efficiently switches between local sliding window attention and full attention, reducing the need for global context access. The findings suggest that full attention is often redundant, and efficient inference can be achieved with on-demand global context access. This has implications for improving the efficiency and scalability of LLMs.
Key Takeaways
- •Proposes All-or-Here Attention (AHA) to dynamically control global attention in LLMs.
- •AHA uses a binary router to switch between full and local attention.
- •Demonstrates significant reduction in full attention operations without performance degradation.
- •Highlights the redundancy of full attention and the importance of on-demand global context access for efficient inference.
Reference
“Up to 93% of full attention operations can be replaced by sliding window attention without performance loss.”