Real-time Dyadic Talking Head Generation with Low Latency

Research Paper#Computer Vision, Generative Models, Talking Heads🔬 Research|Analyzed: Jan 3, 2026 09:30
Published: Dec 30, 2025 18:43
1 min read
ArXiv

Analysis

This paper addresses the critical latency issue in generating realistic dyadic talking head videos, which is essential for realistic listener feedback. The authors propose DyStream, a flow matching-based autoregressive model designed for real-time video generation from both speaker and listener audio. The key innovation lies in its stream-friendly autoregressive framework and a causal encoder with a lookahead module to balance quality and latency. The paper's significance lies in its potential to enable more natural and interactive virtual communication.
Reference / Citation
View Original
"DyStream could generate video within 34 ms per frame, guaranteeing the entire system latency remains under 100 ms. Besides, it achieves state-of-the-art lip-sync quality, with offline and online LipSync Confidence scores of 8.13 and 7.61 on HDTF, respectively."
A
ArXivDec 30, 2025 18:43
* Cited for critical analysis under Article 32.