ResponseRank: Learning Preference Strength for RLHF

Published:Dec 31, 2025 18:21
1 min read
ArXiv

Analysis

This paper introduces ResponseRank, a novel method to improve the efficiency and robustness of Reinforcement Learning from Human Feedback (RLHF). It addresses the limitations of binary preference feedback by inferring preference strength from noisy signals like response times and annotator agreement. The core contribution is a method that leverages relative differences in these signals to rank responses, leading to more effective reward modeling and improved performance in various tasks. The paper's focus on data efficiency and robustness is particularly relevant in the context of training large language models.

Reference

ResponseRank robustly learns preference strength by leveraging locally valid relative strength signals.