Information-Theoretic Debiasing for Reward Models
Analysis
This paper addresses a critical problem in Reinforcement Learning from Human Feedback (RLHF): the presence of inductive biases in reward models. These biases, stemming from low-quality training data, can lead to overfitting and reward hacking. The proposed method, DIR (Debiasing via Information optimization for RM), offers a novel information-theoretic approach to mitigate these biases, handling non-linear correlations and improving RLHF performance. The paper's significance lies in its potential to improve the reliability and generalization of RLHF systems.
Key Takeaways
- •Addresses the problem of inductive biases in reward models, which can lead to overfitting and reward hacking.
- •Proposes a novel information-theoretic debiasing method called DIR (Debiasing via Information optimization for RM).
- •DIR maximizes the mutual information between RM scores and human preference pairs while minimizing the MI between RM outputs and biased attributes.
- •Demonstrates effectiveness in mitigating biases related to response length, sycophancy, and format.
- •Shows improved RLHF performance and better generalization abilities across diverse benchmarks.
- •Provides code and training recipes for reproducibility.
“DIR not only effectively mitigates target inductive biases but also enhances RLHF performance across diverse benchmarks, yielding better generalization abilities.”