M2G-Eval: A Multi-Granularity Benchmark for Code Generation Evaluation

Published:Dec 27, 2025 16:00
1 min read
ArXiv

Analysis

This paper introduces M2G-Eval, a novel benchmark designed to evaluate code generation capabilities of LLMs across multiple granularities (Class, Function, Block, Line) and 18 programming languages. This addresses a significant gap in existing benchmarks, which often focus on a single granularity and limited languages. The multi-granularity approach allows for a more nuanced understanding of model strengths and weaknesses. The inclusion of human-annotated test instances and contamination control further enhances the reliability of the evaluation. The paper's findings highlight performance differences across granularities, language-specific variations, and cross-language correlations, providing valuable insights for future research and model development.

Reference

The paper reveals an apparent difficulty hierarchy, with Line-level tasks easiest and Class-level most challenging.