Real-time Dyadic Talking Head Generation with Low Latency
Research Paper#Computer Vision, Generative Models, Talking Heads🔬 Research|Analyzed: Jan 3, 2026 09:30•
Published: Dec 30, 2025 18:43
•1 min read
•ArXivAnalysis
This paper addresses the critical latency issue in generating realistic dyadic talking head videos, which is essential for realistic listener feedback. The authors propose DyStream, a flow matching-based autoregressive model designed for real-time video generation from both speaker and listener audio. The key innovation lies in its stream-friendly autoregressive framework and a causal encoder with a lookahead module to balance quality and latency. The paper's significance lies in its potential to enable more natural and interactive virtual communication.
Key Takeaways
- •Addresses the high latency problem in dyadic talking head generation.
- •Proposes DyStream, a flow matching-based autoregressive model.
- •Employs a stream-friendly autoregressive framework and a causal encoder with a lookahead module.
- •Achieves real-time video generation with low latency (under 100 ms).
- •Demonstrates state-of-the-art lip-sync quality.
Reference / Citation
View Original"DyStream could generate video within 34 ms per frame, guaranteeing the entire system latency remains under 100 ms. Besides, it achieves state-of-the-art lip-sync quality, with offline and online LipSync Confidence scores of 8.13 and 7.61 on HDTF, respectively."