Synthetic Bootstrapped Pretraining
Published:Dec 16, 2025 00:00
•1 min read
•Apple ML
Analysis
This article introduces Synthetic Bootstrapped Pretraining (SBP), a novel language model pretraining method developed by Apple ML. SBP aims to improve language model performance by modeling inter-document correlations, which are often overlooked in standard pretraining approaches. The core idea is to first learn a model of relationships between documents and then use it to generate a larger synthetic corpus for joint training. This approach is designed to capture richer, more complex relationships within the data, potentially leading to more effective language models. The article highlights the potential of SBP to improve model performance by leveraging inter-document relationships.
Key Takeaways
- •SBP is a new language model pretraining method developed by Apple ML.
- •SBP focuses on modeling inter-document correlations to improve performance.
- •SBP uses a two-step process: learning a model of document relationships and then synthesizing a larger corpus.
Reference
“While the standard pretraining teaches LMs to learn causal correlations among tokens within a single document, it is not designed to efficiently model the rich, learnable inter-document correlations that can potentially lead to better performance.”