AXIOM: Benchmarking LLM-as-a-Judge for Code via Rule-Based Perturbation and Multisource Quality Calibration
Published:Dec 23, 2025 08:39
•1 min read
•ArXiv
Analysis
This article introduces AXIOM, a method for evaluating Large Language Models (LLMs) used as judges for code. It uses rule-based perturbation to create test cases and multisource quality calibration to improve the reliability of the evaluation. The research focuses on the application of LLMs in code evaluation, a critical area for software development and AI-assisted coding.
Key Takeaways
- •AXIOM is a new benchmarking method for evaluating LLMs as code judges.
- •It uses rule-based perturbation to generate test cases.
- •It employs multisource quality calibration to improve evaluation reliability.
- •The research focuses on LLMs in code evaluation, a key area for AI-assisted coding.
Reference
“”