Why are we still training Reward Models when LLM-as-a-Judge is at its peak?
Analysis
The article discusses the continued relevance of training separate Reward Models (RMs) in Reinforcement Learning from Human Feedback (RLHF) despite the advancements in LLM-as-a-Judge techniques, using models like Gemini Pro and GPT-4. It highlights the question of whether training RMs is still necessary given the evaluation capabilities of powerful LLMs. The article suggests that in practical RL training, separate Reward Models are still important.
Key Takeaways
““Given the high evaluation capabilities of Gemini Pro, is it necessary to train individual Reward Models (RMs) even with tedious data cleaning and parameter adjustments? Wouldn't it be better to have the LLM directly determine the reward?””