DeepSeek's mHC: Improving Residual Connections
Analysis
Key Takeaways
- •DeepSeek's mHC improves residual connections by introducing a more flexible and stable approach.
- •The core innovation is using double stochastic constraints on learnable matrices to prevent gradient explosion.
- •mHC demonstrates significant improvements in stability and performance compared to standard baselines.
“DeepSeek solved the instability by constraining the learnable matrices to be "Double Stochastic" (all elements ≧ 0, rows/cols sum to 1). Mathematically, this forces the operation to act as a weighted average (convex combination). It guarantees that signals are never amplified beyond control, regardless of network depth.”