M2G-Eval: A Multi-Granularity Benchmark for Code Generation Evaluation
Analysis
This paper introduces M2G-Eval, a novel benchmark designed to evaluate code generation capabilities of LLMs across multiple granularities (Class, Function, Block, Line) and 18 programming languages. This addresses a significant gap in existing benchmarks, which often focus on a single granularity and limited languages. The multi-granularity approach allows for a more nuanced understanding of model strengths and weaknesses. The inclusion of human-annotated test instances and contamination control further enhances the reliability of the evaluation. The paper's findings highlight performance differences across granularities, language-specific variations, and cross-language correlations, providing valuable insights for future research and model development.
Key Takeaways
- •M2G-Eval is a new benchmark for evaluating code generation in LLMs across multiple granularities and languages.
- •The benchmark reveals performance differences across different code scopes.
- •The study highlights the challenges in generating complex, long-form code.
- •The findings suggest that models learn transferable programming concepts.
“The paper reveals an apparent difficulty hierarchy, with Line-level tasks easiest and Class-level most challenging.”