Joint Data Selection for LLM Pre-training
Analysis
This paper addresses the challenge of efficiently selecting high-quality and diverse data for pre-training large language models (LLMs) at a massive scale. The authors propose DATAMASK, a policy gradient-based framework that jointly optimizes quality and diversity metrics, overcoming the computational limitations of existing methods. The significance lies in its ability to improve both training efficiency and model performance by selecting a more effective subset of data from extremely large datasets. The 98.9% reduction in selection time compared to greedy algorithms is a key contribution, enabling the application of joint learning to trillion-token datasets.
Key Takeaways
- •DATAMASK is a novel framework for joint data selection in LLM pre-training.
- •It uses policy gradient-based optimization to efficiently select data based on quality and diversity metrics.
- •Significantly reduces selection time compared to greedy algorithms.
- •Achieves performance improvements on various LLM architectures.
“DATAMASK achieves significant improvements of 3.2% on a 1.5B dense model and 1.9% on a 7B MoE model.”