MediEval: A Unified Medical Benchmark for Patient-Contextual and Knowledge-Grounded Reasoning in LLMs
Published:Dec 25, 2025 05:00
•1 min read
•ArXiv NLP
Analysis
This paper introduces MediEval, a novel benchmark designed to evaluate the reliability and safety of Large Language Models (LLMs) in medical applications. It addresses a critical gap in existing evaluations by linking electronic health records (EHRs) to a unified knowledge base, enabling systematic assessment of knowledge grounding and contextual consistency. The identification of failure modes like hallucinated support and truth inversion is significant. The proposed Counterfactual Risk-Aware Fine-tuning (CoRFu) method demonstrates a promising approach to improve both accuracy and safety, suggesting a pathway towards more reliable LLMs in healthcare. The benchmark and the fine-tuning method are valuable contributions to the field, paving the way for safer and more trustworthy AI applications in medicine.
Key Takeaways
- •MediEval provides a standardized benchmark for evaluating LLMs in medical contexts.
- •The study identifies critical failure modes in current LLMs, such as hallucination and truth inversion.
- •CoRFu fine-tuning significantly improves LLM safety and accuracy in medical reasoning.
Reference
“We introduce MediEval, a benchmark that links MIMIC-IV electronic health records (EHRs) to a unified knowledge base built from UMLS and other biomedical vocabularies.”