Search:
Match:
1 results

Analysis

This paper provides a first-order analysis of how cross-entropy training shapes attention scores and value vectors in transformer attention heads. It reveals an 'advantage-based routing law' and a 'responsibility-weighted update' that induce a positive feedback loop, leading to the specialization of queries and values. The work connects optimization (gradient flow) to geometry (Bayesian manifolds) and function (probabilistic reasoning), offering insights into how transformers learn.
Reference

The core result is an 'advantage-based routing law' for attention scores and a 'responsibility-weighted update' for values, which together induce a positive feedback loop.