Andrej Karpathy 谈论来自可验证奖励的强化学习 (RLVR)

Research #llm 📝 Blog|分析: 2025年12月25日 13:22•

发布: 2025年12月19日 23:07

•

2分で読める

分析

这篇文章引用了 Andrej Karpathy 关于来自可验证奖励的强化学习 (RLVR) 作为 LLM 领域一项重大进展的观点。 Karpathy 认为，使用自动可验证的奖励来训练 LLM，尤其是在数学和代码谜题等环境中，会导致类似推理策略的自发发展。这些策略包括将问题分解为中间计算，并采用各种问题解决技术。 DeepSeek R1 论文被引为示例。这种方法代表着向更可验证和可解释的 AI 的转变，有可能缓解 LLM 中“黑盒”决策的问题。关注可验证的奖励可能会带来更强大和可靠的 AI 系统。

要点

引用 / 来源

查看原文

"In 2025, Reinforcement Learning from Verifiable Rewards (RLVR) emerged as the de facto new major stage to add to this mix. By training LLMs against automatically verifiable rewards across a number of environments (e.g. think math/code puzzles), the LLMs spontaneously develop strategies that look like "reasoning" to humans - they learn to break down problem solving into intermediate calculations and they learn a number of problem solving strategies for going back and forth to figure things out (see DeepSeek R1 paper for examples)."

Simon Willison2025年12月19日 23:07

* 根据版权法第32条进行合法引用。

较旧

Focus on Learning, Not Teaching: A Shift in Educational Perspective

较新

Sam Rose Explains LLMs with Visual Essay

Andrej Karpathy 谈论来自可验证奖励的强化学习 (RLVR)

分析

要点

相关分析

人类AI检测

侧重于实现的深度学习书籍

个性化 Gemini

📬 获取AI新闻

按类别浏览

热门话题

📬 获取AI新闻

按类别浏览

热门话题