GRPO and DPO for Faithful Chain-of-Thought Reasoning in LLMs

Published:Dec 27, 2025 16:07
1 min read
ArXiv

Analysis

This paper investigates the faithfulness of Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs). It highlights the issue of models generating misleading justifications, which undermines the reliability of CoT-based methods. The study evaluates Group Relative Policy Optimization (GRPO) and Direct Preference Optimization (DPO) to improve CoT faithfulness, finding GRPO to be more effective, especially in larger models. This is important because it addresses the critical need for transparency and trustworthiness in LLM reasoning, particularly for safety and alignment.

Reference

GRPO achieves higher performance than DPO in larger models, with the Qwen2.5-14B-Instruct model attaining the best results across all evaluation metrics.