SoulX-LiveTalk: Real-Time Audio-Driven Avatars
Published:Dec 29, 2025 11:18
•1 min read
•ArXiv
Analysis
This paper introduces SoulX-LiveTalk, a 14B-parameter framework for generating high-fidelity, real-time, audio-driven avatars. The key innovation is a Self-correcting Bidirectional Distillation strategy that maintains bidirectional attention for improved motion coherence and visual detail, and a Multi-step Retrospective Self-Correction Mechanism to prevent error accumulation during infinite generation. The paper addresses the challenge of balancing computational load and latency in real-time avatar generation, a significant problem in the field. The achievement of sub-second start-up latency and real-time throughput is a notable advancement.
Key Takeaways
- •Addresses the challenge of real-time, high-fidelity audio-driven avatar generation.
- •Introduces Self-correcting Bidirectional Distillation for improved visual quality and motion coherence.
- •Employs a Multi-step Retrospective Self-Correction Mechanism to prevent error accumulation.
- •Achieves sub-second start-up latency and real-time throughput (32 FPS) with a 14B-parameter model.
Reference
“SoulX-LiveTalk is the first 14B-scale system to achieve a sub-second start-up latency (0.87s) while reaching a real-time throughput of 32 FPS.”