DeliberationBench: Multi-LLM Deliberation Underperforms Baseline, Raising Questions on Complexity
Analysis
Key Takeaways
- •Multi-LLM deliberation protocols were benchmarked against a single-output baseline.
- •The baseline significantly outperformed all deliberation protocols in terms of accuracy.
- •Deliberation protocols incurred higher computational costs than the baseline.
“the best-single baseline achieves an 82.5% +- 3.3% win rate, dramatically outperforming the best deliberation protocol(13.8% +- 2.6%)”