Search:
Match:
1 results

Analysis

This paper addresses the critical latency issue in generating realistic dyadic talking head videos, which is essential for realistic listener feedback. The authors propose DyStream, a flow matching-based autoregressive model designed for real-time video generation from both speaker and listener audio. The key innovation lies in its stream-friendly autoregressive framework and a causal encoder with a lookahead module to balance quality and latency. The paper's significance lies in its potential to enable more natural and interactive virtual communication.
Reference

DyStream could generate video within 34 ms per frame, guaranteeing the entire system latency remains under 100 ms. Besides, it achieves state-of-the-art lip-sync quality, with offline and online LipSync Confidence scores of 8.13 and 7.61 on HDTF, respectively.