Gradient Descent as Implicit EM in Distance-Based Neural Models
Analysis
Key Takeaways
- •Gradient descent on distance/energy-based objectives implicitly performs EM.
- •This unifies unsupervised learning, attention, and classification under a single mechanism.
- •Bayesian behavior in transformers is a consequence of objective geometry, not an emergent property.
- •Optimization and inference are the same process in these models.
“For any objective with log-sum-exp structure over distances or energies, the gradient with respect to each distance is exactly the negative posterior responsibility of the corresponding component: $\partial L / \partial d_j = -r_j$.”