Search: AXIOMは、コードジャッジとしてのLLMを評価するための新しいベンチマーキング手法です。 - ai.jp.net

Research #llm 🔬 ResearchAnalyzed: Jan 4, 2026 07:12

AXIOM: Benchmarking LLM-as-a-Judge for Code via Rule-Based Perturbation and Multisource Quality Calibration

Published:Dec 23, 2025 08:39

•

1 min read

•

ArXiv

Analysis

This article introduces AXIOM, a method for evaluating Large Language Models (LLMs) used as judges for code. It uses rule-based perturbation to create test cases and multisource quality calibration to improve the reliability of the evaluation. The research focuses on the application of LLMs in code evaluation, a critical area for software development and AI-assisted coding.

Key Takeaways

•AXIOM is a new benchmarking method for evaluating LLMs as code judges.
•It uses rule-based perturbation to generate test cases.
•It employs multisource quality calibration to improve evaluation reliability.
•The research focuses on LLMs in code evaluation, a key area for AI-assisted coding.

Reference

“”

Permalink ArXiv

AXIOM: Benchmarking LLM-as-a-Judge for Code via Rule-Based Perturbation and Multisource Quality Calibration

Analysis

Key Takeaways

📬 Get AI News Delivered

Browse by Category

Trending Topics

📬 Get AI News Delivered

Browse by Category

Trending Topics