Information-Theoretic Debiasing for Reward Models
Analysis
Key Takeaways
- •Addresses the problem of inductive biases in reward models, which can lead to overfitting and reward hacking.
- •Proposes a novel information-theoretic debiasing method called DIR (Debiasing via Information optimization for RM).
- •DIR maximizes the mutual information between RM scores and human preference pairs while minimizing the MI between RM outputs and biased attributes.
- •Demonstrates effectiveness in mitigating biases related to response length, sycophancy, and format.
- •Shows improved RLHF performance and better generalization abilities across diverse benchmarks.
- •Provides code and training recipes for reproducibility.
“DIR not only effectively mitigates target inductive biases but also enhances RLHF performance across diverse benchmarks, yielding better generalization abilities.”