Search:
Match:
2 results
Paper#llm🔬 ResearchAnalyzed: Jan 3, 2026 19:14

Stable LLM RL via Dynamic Vocabulary Pruning

Published:Dec 28, 2025 21:44
1 min read
ArXiv

Analysis

This paper addresses the instability in Reinforcement Learning (RL) for Large Language Models (LLMs) caused by the mismatch between training and inference probability distributions, particularly in the tail of the token probability distribution. The authors identify that low-probability tokens in the tail contribute significantly to this mismatch and destabilize gradient estimation. Their proposed solution, dynamic vocabulary pruning, offers a way to mitigate this issue by excluding the extreme tail of the vocabulary, leading to more stable training.
Reference

The authors propose constraining the RL objective to a dynamically-pruned ``safe'' vocabulary that excludes the extreme tail.

Analysis

This article likely discusses a method to ensure consistent results during inference, regardless of the tensor parallel size used. This is a crucial problem in large language model (LLM) deployment, as different hardware configurations can lead to varying outputs. The deterministic approach aims to provide reliable and predictable results.
Reference