Gradient Dynamics of Attention in Transformers

Research Paper#Transformer Attention, Gradient Descent, Bayesian Inference🔬 Research|Analyzed: Jan 3, 2026 16:27
Published: Dec 27, 2025 05:31
1 min read
ArXiv

Analysis

This paper provides a first-order analysis of how cross-entropy training shapes attention scores and value vectors in transformer attention heads. It reveals an 'advantage-based routing law' and a 'responsibility-weighted update' that induce a positive feedback loop, leading to the specialization of queries and values. The work connects optimization (gradient flow) to geometry (Bayesian manifolds) and function (probabilistic reasoning), offering insights into how transformers learn.
Reference / Citation
View Original
"The core result is an 'advantage-based routing law' for attention scores and a 'responsibility-weighted update' for values, which together induce a positive feedback loop."
A
ArXivDec 27, 2025 05:31
* Cited for critical analysis under Article 32.