Search: MLLM。 - ai.jp.net

Paper #MLLM, Computer Vision, Segmentation 🔬 ResearchAnalyzed: Jan 3, 2026 17:05

RSAgent: Agentic MLLM for Text-Guided Segmentation

Published:Dec 30, 2025 06:50

•

1 min read

•

ArXiv

Analysis

This paper introduces RSAgent, an agentic MLLM designed to improve text-guided object segmentation. The key innovation is the multi-turn approach, allowing for iterative refinement of segmentation masks through tool invocations and feedback. This addresses limitations of one-shot methods by enabling verification, refocusing, and refinement. The paper's significance lies in its novel agent-based approach to a challenging computer vision task, demonstrating state-of-the-art performance on multiple benchmarks.

Key Takeaways

•RSAgent uses an agentic MLLM for text-guided segmentation.
•It employs a multi-turn approach with tool invocations and feedback for iterative refinement.
•The method addresses limitations of one-shot segmentation approaches.
•RSAgent achieves state-of-the-art performance on multiple benchmarks.

Reference

“RSAgent achieves a zero-shot performance of 66.5% gIoU on ReasonSeg test, improving over Seg-Zero-7B by 9%, and reaches 81.5% cIoU on RefCOCOg, demonstrating state-of-the-art performance.”

Permalink ArXiv

Research Paper #Multimodal LLM, Audio-Video Understanding and Generation 🔬 ResearchAnalyzed: Jan 3, 2026 16:18

JavisGPT: Unified MLLM for Audio-Video Understanding and Generation

Published:Dec 28, 2025 12:25

•

1 min read

•

ArXiv

Analysis

This paper introduces JavisGPT, a novel multimodal large language model (MLLM) designed for joint audio-video (JAV) comprehension and generation. Its significance lies in its unified architecture, the SyncFusion module for spatio-temporal fusion, and the use of learnable queries to connect to a pretrained generator. The creation of a large-scale instruction dataset (JavisInst-Omni) with over 200K dialogues is crucial for training and evaluating the model's capabilities. The paper's contribution is in advancing the state-of-the-art in understanding and generating content from both audio and video inputs, especially in complex and synchronized scenarios.

Key Takeaways

•JavisGPT is the first unified MLLM for joint audio-video comprehension and generation.
•It uses a SyncFusion module for spatio-temporal audio-video fusion.
•A large-scale instruction dataset (JavisInst-Omni) was created to support training.
•JavisGPT demonstrates superior performance on JAV benchmarks.

Reference

“JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.”

Permalink ArXiv

Research #MLLM 🔬 ResearchAnalyzed: Jan 10, 2026 08:34

D2Pruner: A Novel Approach to Token Pruning in MLLMs

Published:Dec 22, 2025 14:42

•

1 min read

•

ArXiv

Analysis

This research paper introduces D2Pruner, a method to improve the efficiency of Multimodal Large Language Models (MLLMs) through token pruning. The work focuses on debiasing importance and promoting structural diversity in the token selection process, potentially leading to faster and more efficient MLLMs.

Key Takeaways

•D2Pruner aims to improve MLLM efficiency.
•The method uses debiased importance and structural diversity.
•This research is a contribution to token pruning techniques.

Reference

“The paper focuses on debiasing importance and promoting structural diversity in the token selection process.”

Permalink ArXiv

Research #MLLM 🔬 ResearchAnalyzed: Jan 10, 2026 09:43

CodeDance: Enhancing Visual Reasoning with Dynamic Tool Integration

Published:Dec 19, 2025 07:52

•

1 min read

•

ArXiv

Analysis

This research introduces CodeDance, a novel approach to visual reasoning. The integration of dynamic tools within the MLLM framework presents a significant advancement in executable visual reasoning capabilities.

Key Takeaways

•CodeDance leverages MLLMs.
•The core innovation is dynamic tool integration.
•Focuses on executable visual reasoning.

Reference

“CodeDance is a Dynamic Tool-integrated MLLM for Executable Visual Reasoning.”

Permalink ArXiv

Research #llm 🏛️ OfficialAnalyzed: Dec 28, 2025 21:57

UniGen-1.5: Improving Image Generation and Editing with Unified Rewards in Reinforcement Learning

Published:Dec 16, 2025 00:00

•

1 min read

•

Apple ML

Analysis

The article introduces UniGen-1.5, an updated multimodal large language model (MLLM) developed by Apple ML, focusing on image understanding, generation, and editing. The core innovation lies in a unified Reinforcement Learning (RL) strategy that uses shared reward models to improve both image generation and editing capabilities simultaneously. This approach aims to enhance the model's performance across various image-related tasks. The article also mentions a 'light Edit Instruction Alignment stage' to further boost image editing, suggesting a focus on practical application and refinement of existing techniques. The emphasis on a unified approach and shared rewards indicates a potential efficiency gain in training and a more cohesive model.

Key Takeaways

•UniGen-1.5 is a new MLLM focused on image understanding, generation, and editing.
•It uses a unified Reinforcement Learning strategy with shared reward models.
•The model aims to improve both image generation and editing capabilities simultaneously.

Reference

“We present UniGen-1.5, a unified multimodal large language model (MLLM) for advanced image understanding, generation and editing.”

Permalink Apple ML

RSAgent: Agentic MLLM for Text-Guided Segmentation

Analysis

Key Takeaways

JavisGPT: Unified MLLM for Audio-Video Understanding and Generation

Analysis

Key Takeaways

D2Pruner: A Novel Approach to Token Pruning in MLLMs

Analysis

Key Takeaways

CodeDance: Enhancing Visual Reasoning with Dynamic Tool Integration

Analysis

Key Takeaways

UniGen-1.5: Improving Image Generation and Editing with Unified Rewards in Reinforcement Learning

Analysis

Key Takeaways

📬 Get AI News Delivered

Browse by Category

Trending Topics

📬 Get AI News Delivered

Browse by Category

Trending Topics