Stable LLM RL via Dynamic Vocabulary Pruning
Published:Dec 28, 2025 21:44
•1 min read
•ArXiv
Analysis
This paper addresses the instability in Reinforcement Learning (RL) for Large Language Models (LLMs) caused by the mismatch between training and inference probability distributions, particularly in the tail of the token probability distribution. The authors identify that low-probability tokens in the tail contribute significantly to this mismatch and destabilize gradient estimation. Their proposed solution, dynamic vocabulary pruning, offers a way to mitigate this issue by excluding the extreme tail of the vocabulary, leading to more stable training.
Key Takeaways
- •Addresses the training-inference mismatch problem in LLM RL.
- •Identifies the tail of the token probability distribution as a key source of instability.
- •Proposes dynamic vocabulary pruning as a solution to stabilize training.
- •Offers a theoretical bound on the optimization bias introduced by pruning.
Reference
“The authors propose constraining the RL objective to a dynamically-pruned ``safe'' vocabulary that excludes the extreme tail.”