Mobile-Efficient Speech Emotion Recognition with Distilled HuBERT
Published:Dec 29, 2025 12:53
•1 min read
•ArXiv
Analysis
This paper addresses the challenge of deploying Speech Emotion Recognition (SER) on mobile devices by proposing a mobile-efficient system based on DistilHuBERT. The authors demonstrate a significant reduction in model size while maintaining competitive accuracy, making it suitable for resource-constrained environments. The cross-corpus validation and analysis of performance on different datasets (IEMOCAP, CREMA-D, RAVDESS) provide valuable insights into the model's generalization capabilities and limitations, particularly regarding the impact of acted emotions.
Key Takeaways
- •DistilHuBERT enables mobile-efficient SER with a significant reduction in model size.
- •Cross-corpus training improves generalization and performance.
- •Theatrical acting styles in datasets like RAVDESS can impact emotion classification accuracy, leading to arousal-based clustering.
- •The model demonstrates a good balance between model size and accuracy, suitable for mobile devices.
Reference
“The model achieves an Unweighted Accuracy of 61.4% with a quantized model footprint of only 23 MB, representing approximately 91% of the Unweighted Accuracy of a full-scale baseline.”