NL2Repo-Bench: Evaluating Long-Horizon Code Generation Agents
Research#Agent🔬 Research|Analyzed: Jan 10, 2026 11:23•
Published: Dec 14, 2025 15:12
•1 min read
•ArXivAnalysis
This ArXiv paper introduces NL2Repo-Bench, a new benchmark for evaluating coding agents. The benchmark focuses on assessing the performance of agents in generating complete and complex software repositories.
Key Takeaways
- •NL2Repo-Bench is designed for evaluating long-horizon code generation.
- •The benchmark focuses on repository generation, implying broader capabilities than simple code snippets.
- •The paper is published on ArXiv, suggesting early-stage research.
Reference / Citation
View Original"NL2Repo-Bench aims to evaluate coding agents."