Improving Mixture-of-Experts with Expert-Router Coupling
Published:Dec 29, 2025 13:03
•1 min read
•ArXiv
Analysis
This paper addresses a key limitation in Mixture-of-Experts (MoE) models: the misalignment between the router's decisions and the experts' capabilities. The proposed Expert-Router Coupling (ERC) loss offers a computationally efficient method to tightly couple the router and experts, leading to improved performance and providing insights into expert specialization. The fixed computational cost, independent of batch size, is a significant advantage over previous methods.
Key Takeaways
- •Proposes a novel Expert-Router Coupling (ERC) loss to improve MoE models.
- •ERC loss tightly couples the router's decisions with expert capabilities.
- •Computationally efficient, with a fixed cost independent of batch size.
- •Demonstrates improved performance on MoE-LLMs ranging from 3B to 15B parameters.
- •Provides flexible control and tracking of expert specialization levels.
Reference
“The ERC loss enforces two constraints: (1) Each expert must exhibit higher activation for its own proxy token than for the proxy tokens of any other expert. (2) Each proxy token must elicit stronger activation from its corresponding expert than from any other expert.”