Rethinking Fine-Tuned Language Models for Vulnerability Repair
Published:Dec 27, 2025 16:12
•1 min read
•ArXiv
Analysis
This paper investigates the limitations of fine-tuned language models for automated vulnerability repair (AVR). It highlights overfitting, non-exclusive dataset splits, and the inadequacy of match-based evaluation metrics. The study's significance lies in its critical assessment of current AVR techniques and its proposal of a new benchmark (L-AVRBench) to improve evaluation and understanding of model capabilities.
Key Takeaways
- •Current AVR models may overfit to training data.
- •Existing evaluation methods might be misleading due to dataset overlap.
- •Match-based metrics may not accurately reflect repair capabilities.
- •The paper introduces a new benchmark (L-AVRBench) for improved evaluation.
Reference
“State-of-the-art models often overfit to the training set and are evaluated using training, validation, and test sets that are not mutually exclusive.”