EchoFoley: Event-Centric Sound Generation for Videos
Published:Dec 31, 2025 08:58
•1 min read
•ArXiv
Analysis
This paper addresses limitations in video-to-audio generation by introducing a new task, EchoFoley, focused on fine-grained control over sound effects in videos. It proposes a novel framework, EchoVidia, and a new dataset, EchoFoley-6k, to improve controllability and perceptual quality compared to existing methods. The focus on event-level control and hierarchical semantics is a significant contribution to the field.
Key Takeaways
- •Introduces EchoFoley, a new task for video-grounded sound generation with event-level and hierarchical control.
- •Proposes EchoVidia, a sounding-event-centric generation framework.
- •Creates EchoFoley-6k, a large-scale benchmark dataset.
- •Demonstrates improved controllability and perceptual quality compared to existing VT2A models.
Reference
“EchoVidia surpasses recent VT2A models by 40.7% in controllability and 12.5% in perceptual quality.”