Enhancing Visual Perception in Vision-Language Models with TWIN Dataset
Analysis
Key Takeaways
- •Introduces TWIN, a new dataset and task for improving fine-grained visual perception in VLMs.
- •TWIN focuses on distinguishing between visually similar images of the same object.
- •Demonstrates significant performance gains on fine-grained recognition tasks.
- •Introduces FGVQA, a new benchmark for evaluating fine-grained visual understanding.
- •TWIN is designed to be a drop-in addition to existing VLM training corpora.
“Fine-tuning VLMs on TWIN yields notable gains in fine-grained recognition, even on unseen domains such as art, animals, plants, and landmarks.”