EchoFoley: Event-Centric Sound Generation for Videos
Analysis
Key Takeaways
- •Introduces EchoFoley, a new task for video-grounded sound generation with event-level and hierarchical control.
- •Proposes EchoVidia, a sounding-event-centric generation framework.
- •Creates EchoFoley-6k, a large-scale benchmark dataset.
- •Demonstrates improved controllability and perceptual quality compared to existing VT2A models.
“EchoVidia surpasses recent VT2A models by 40.7% in controllability and 12.5% in perceptual quality.”