Search:
Match:
1 results

Analysis

This paper introduces OmniAgent, a novel approach to audio-visual understanding that moves beyond passive response generation to active multimodal inquiry. It addresses limitations in existing omnimodal models by employing dynamic planning and a coarse-to-fine audio-guided perception paradigm. The agent strategically uses specialized tools, focusing on task-relevant cues, leading to significant performance improvements on benchmark datasets.
Reference

OmniAgent achieves state-of-the-art performance, surpassing leading open-source and proprietary models by substantial margins of 10% - 20% accuracy.