Training Data Optimization for LLM Code Generation: An Empirical Study
Published:Dec 31, 2025 02:30
•1 min read
•ArXiv
Analysis
This paper addresses the critical issue of improving LLM-based code generation by systematically evaluating training data optimization techniques. It's significant because it provides empirical evidence on the effectiveness of different techniques and their combinations, offering practical guidance for researchers and practitioners. The large-scale study across multiple benchmarks and LLMs adds to the paper's credibility and impact.
Key Takeaways
- •Data synthesis is the most effective technique for improving functional correctness and reducing code smells.
- •Data synthesis combined with data refactoring achieves the strongest overall performance.
- •Most combinations of techniques do not further improve functional correctness but can enhance code quality (code smells and maintainability).
Reference
“Data synthesis is the most effective technique for improving functional correctness and reducing code smells.”