JavisGPT: Unified MLLM for Audio-Video Understanding and Generation

Published:Dec 28, 2025 12:25
1 min read
ArXiv

Analysis

This paper introduces JavisGPT, a novel multimodal large language model (MLLM) designed for joint audio-video (JAV) comprehension and generation. Its significance lies in its unified architecture, the SyncFusion module for spatio-temporal fusion, and the use of learnable queries to connect to a pretrained generator. The creation of a large-scale instruction dataset (JavisInst-Omni) with over 200K dialogues is crucial for training and evaluating the model's capabilities. The paper's contribution is in advancing the state-of-the-art in understanding and generating content from both audio and video inputs, especially in complex and synchronized scenarios.

Reference

JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.