Research#llm📝 BlogAnalyzed: Dec 25, 2025 13:22

Andrej Karpathy on Reinforcement Learning from Verifiable Rewards (RLVR)

Published:Dec 19, 2025 23:07
2 min read
Simon Willison

Analysis

This article quotes Andrej Karpathy on the emergence of Reinforcement Learning from Verifiable Rewards (RLVR) as a significant advancement in LLMs. Karpathy suggests that training LLMs with automatically verifiable rewards, particularly in environments like math and code puzzles, leads to the spontaneous development of reasoning-like strategies. These strategies involve breaking down problems into intermediate calculations and employing various problem-solving techniques. The DeepSeek R1 paper is cited as an example. This approach represents a shift towards more verifiable and explainable AI, potentially mitigating issues of "black box" decision-making in LLMs. The focus on verifiable rewards could lead to more robust and reliable AI systems.

Reference

In 2025, Reinforcement Learning from Verifiable Rewards (RLVR) emerged as the de facto new major stage to add to this mix. By training LLMs against automatically verifiable rewards across a number of environments (e.g. think math/code puzzles), the LLMs spontaneously develop strategies that look like "reasoning" to humans - they learn to break down problem solving into intermediate calculations and they learn a number of problem solving strategies for going back and forth to figure things out (see DeepSeek R1 paper for examples).