ResponseRank: Learning Preference Strength for RLHF
Analysis
Key Takeaways
- •Proposes ResponseRank, a method for learning preference strength from noisy signals in RLHF.
- •Uses relative differences in proxy signals (response times, annotator agreement) to rank responses.
- •Demonstrates improved sample efficiency and robustness across synthetic, language modeling, and RL control tasks.
- •Introduces the Pearson Distance Correlation (PDC) metric for evaluating utility learning.
“ResponseRank robustly learns preference strength by leveraging locally valid relative strength signals.”