Research Paper#Computer Vision, Generative Models, Talking Heads🔬 ResearchAnalyzed: Jan 3, 2026 09:30
Real-time Dyadic Talking Head Generation with Low Latency
Published:Dec 30, 2025 18:43
•1 min read
•ArXiv
Analysis
This paper addresses the critical latency issue in generating realistic dyadic talking head videos, which is essential for realistic listener feedback. The authors propose DyStream, a flow matching-based autoregressive model designed for real-time video generation from both speaker and listener audio. The key innovation lies in its stream-friendly autoregressive framework and a causal encoder with a lookahead module to balance quality and latency. The paper's significance lies in its potential to enable more natural and interactive virtual communication.
Key Takeaways
- •Addresses the high latency problem in dyadic talking head generation.
- •Proposes DyStream, a flow matching-based autoregressive model.
- •Employs a stream-friendly autoregressive framework and a causal encoder with a lookahead module.
- •Achieves real-time video generation with low latency (under 100 ms).
- •Demonstrates state-of-the-art lip-sync quality.
Reference
“DyStream could generate video within 34 ms per frame, guaranteeing the entire system latency remains under 100 ms. Besides, it achieves state-of-the-art lip-sync quality, with offline and online LipSync Confidence scores of 8.13 and 7.61 on HDTF, respectively.”