Hyperparameter Transfer for Efficient Model Scaling
Analysis
This paper addresses the critical challenge of hyperparameter tuning in large-scale models. It extends existing work on hyperparameter transfer by unifying scaling across width, depth, batch size, and training duration. The key contribution is the investigation of per-module hyperparameter optimization and transfer, demonstrating that optimal hyperparameters found on smaller models can be effectively applied to larger models, leading to significant training speed improvements, particularly in Large Language Models. This is a practical contribution to the efficiency of training large models.
Key Takeaways
- •Proposes a Complete^{(d)} Parameterisation to unify scaling across width, depth, batch-size, and training duration.
- •Investigates per-module hyperparameter optimization and transfer.
- •Demonstrates significant training speed improvements in Large Language Models with transferred per-module hyperparameters.
- •Provides practical guidelines for navigating the high-dimensional hyperparameter landscape.
“The paper demonstrates that, with the right parameterisation, hyperparameter transfer holds even in the per-module hyperparameter regime.”