ManchuTTS: High-Quality Speech Synthesis for an Endangered Language

Published:Dec 27, 2025 06:21
1 min read
ArXiv

Analysis

This paper addresses the challenge of speech synthesis for the endangered Manchu language, which faces data scarcity and complex agglutination. The proposed ManchuTTS model introduces innovative techniques like a hierarchical text representation, cross-modal attention, flow-matching Transformer, and hierarchical contrastive loss to overcome these challenges. The creation of a dedicated dataset and data augmentation further contribute to the model's effectiveness. The results, including a high MOS score and significant improvements in agglutinative word pronunciation and prosodic naturalness, demonstrate the paper's significant contribution to the field of low-resource speech synthesis and language preservation.

Reference

ManchuTTS attains a MOS of 4.52 using a 5.2-hour training subset...outperforming all baseline models by a notable margin.