VL4Gaze: Unleashing Vision-Language Models for Gaze Following
Analysis
This paper introduces VL4Gaze, a new large-scale benchmark for evaluating and training vision-language models (VLMs) for gaze understanding. The lack of such benchmarks has hindered the exploration of gaze interpretation capabilities in VLMs. VL4Gaze addresses this gap by providing a comprehensive dataset with question-answer pairs designed to test various aspects of gaze understanding, including object description, direction description, point location, and ambiguous question recognition. The study reveals that existing VLMs struggle with gaze understanding without specific training, but performance significantly improves with fine-tuning on VL4Gaze. This highlights the necessity of targeted supervision for developing gaze understanding capabilities in VLMs and provides a valuable resource for future research in this area. The benchmark's multi-task approach is a key strength.
Key Takeaways
- •VL4Gaze is a new benchmark for gaze understanding in VLMs.
- •Existing VLMs struggle with gaze understanding without specific training.
- •Fine-tuning on VL4Gaze significantly improves performance.
“...training on VL4Gaze brings substantial and consistent improvements across all tasks, highlighting the importance of targeted multi-task supervision for developing gaze understanding capabilities”