Composite Score for LLM Reliability
Analysis
This paper addresses a critical issue in the deployment of Large Language Models (LLMs): their reliability. It moves beyond simply evaluating accuracy and tackles the crucial aspects of calibration, robustness, and uncertainty quantification. The introduction of the Composite Reliability Score (CRS) provides a unified framework for assessing these aspects, offering a more comprehensive and interpretable metric than existing fragmented evaluations. This is particularly important as LLMs are increasingly used in high-stakes domains.
Key Takeaways
- •Introduces the Composite Reliability Score (CRS) as a unified metric for LLM reliability.
- •Integrates calibration, robustness, and uncertainty quantification.
- •Evaluates ten open-source LLMs across five QA datasets.
- •CRS provides stable model rankings and reveals hidden failure modes.
- •Highlights the importance of balancing accuracy, robustness, and calibrated uncertainty for dependable LLMs.
“The Composite Reliability Score (CRS) delivers stable model rankings, uncovers hidden failure modes missed by single metrics, and highlights that the most dependable systems balance accuracy, robustness, and calibrated uncertainty.”