Enhancing Visual Perception in Vision-Language Models with TWIN Dataset
Published:Dec 29, 2025 16:43
•1 min read
•ArXiv
Analysis
This paper introduces a novel training dataset and task (TWIN) designed to improve the fine-grained visual perception capabilities of Vision-Language Models (VLMs). The core idea is to train VLMs to distinguish between visually similar images of the same object, forcing them to attend to subtle visual details. The paper demonstrates significant improvements on fine-grained recognition tasks and introduces a new benchmark (FGVQA) to quantify these gains. The work addresses a key limitation of current VLMs and provides a practical contribution in the form of a new dataset and training methodology.
Key Takeaways
- •Introduces TWIN, a new dataset and task for improving fine-grained visual perception in VLMs.
- •TWIN focuses on distinguishing between visually similar images of the same object.
- •Demonstrates significant performance gains on fine-grained recognition tasks.
- •Introduces FGVQA, a new benchmark for evaluating fine-grained visual understanding.
- •TWIN is designed to be a drop-in addition to existing VLM training corpora.
Reference
“Fine-tuning VLMs on TWIN yields notable gains in fine-grained recognition, even on unseen domains such as art, animals, plants, and landmarks.”