Hyperparameter Transfer for Efficient Model Scaling

Published:Dec 26, 2025 20:56
1 min read
ArXiv

Analysis

This paper addresses the critical challenge of hyperparameter tuning in large-scale models. It extends existing work on hyperparameter transfer by unifying scaling across width, depth, batch size, and training duration. The key contribution is the investigation of per-module hyperparameter optimization and transfer, demonstrating that optimal hyperparameters found on smaller models can be effectively applied to larger models, leading to significant training speed improvements, particularly in Large Language Models. This is a practical contribution to the efficiency of training large models.

Reference

The paper demonstrates that, with the right parameterisation, hyperparameter transfer holds even in the per-module hyperparameter regime.