Stable LLM RL via Dynamic Vocabulary Pruning
Analysis
Key Takeaways
- •Addresses the training-inference mismatch problem in LLM RL.
- •Identifies the tail of the token probability distribution as a key source of instability.
- •Proposes dynamic vocabulary pruning as a solution to stabilize training.
- •Offers a theoretical bound on the optimization bias introduced by pruning.
“The authors propose constraining the RL objective to a dynamically-pruned ``safe'' vocabulary that excludes the extreme tail.”