Search:
Match:
1 results

Analysis

This paper addresses the challenge of off-policy mismatch in long-horizon LLM reinforcement learning, a critical issue due to implementation divergence and other factors. It derives tighter trust region bounds and introduces Trust Region Masking (TRM) to provide monotonic improvement guarantees, a significant advancement for long-horizon tasks.
Reference

The paper proposes Trust Region Masking (TRM), which excludes entire sequences from gradient computation if any token violates the trust region, providing the first non-vacuous monotonic improvement guarantees for long-horizon LLM-RL.