OmniAgent: Audio-Guided Active Perception for Audio-Video Understanding

Analysis

This paper introduces OmniAgent, a novel approach to audio-visual understanding that moves beyond passive response generation to active multimodal inquiry. It addresses limitations in existing omnimodal models by employing dynamic planning and a coarse-to-fine audio-guided perception paradigm. The agent strategically uses specialized tools, focusing on task-relevant cues, leading to significant performance improvements on benchmark datasets.
Reference / Citation
View Original
"OmniAgent achieves state-of-the-art performance, surpassing leading open-source and proprietary models by substantial margins of 10% - 20% accuracy."
A
ArXivDec 29, 2025 17:59
* Cited for critical analysis under Article 32.