ManchuTTS: High-Quality Speech Synthesis for an Endangered Language
Analysis
This paper addresses the challenge of speech synthesis for the endangered Manchu language, which faces data scarcity and complex agglutination. The proposed ManchuTTS model introduces innovative techniques like a hierarchical text representation, cross-modal attention, flow-matching Transformer, and hierarchical contrastive loss to overcome these challenges. The creation of a dedicated dataset and data augmentation further contribute to the model's effectiveness. The results, including a high MOS score and significant improvements in agglutinative word pronunciation and prosodic naturalness, demonstrate the paper's significant contribution to the field of low-resource speech synthesis and language preservation.
Key Takeaways
- •Addresses the challenge of speech synthesis for a low-resource, agglutinative language (Manchu).
- •Proposes a novel ManchuTTS model with a three-tier text representation and hierarchical attention.
- •Employs flow-matching Transformer for efficient, non-autoregressive generation.
- •Introduces a hierarchical contrastive loss for structured acoustic-linguistic correspondence.
- •Achieves state-of-the-art results with a high MOS score and significant improvements in pronunciation and prosody.
“ManchuTTS attains a MOS of 4.52 using a 5.2-hour training subset...outperforming all baseline models by a notable margin.”