KL Regularization in RL Training of LLMs: A Deep Dive
Analysis
Key Takeaways
- •Different KL divergence estimators used in RL training of LLMs can significantly impact performance.
- •Configurations with biased gradients can lead to training instabilities.
- •Unbiased gradient estimators generally lead to better performance.
- •KL regularization can stabilize off-policy RL training.
“Using estimator configurations resulting in unbiased gradients leads to better performance on in-domain as well as out-of-domain tasks.”