SynTTS Commands Media Benchmarks
π Project Navigation
Welcome to the official model repository for the paper "SynTTS-Commands". Here you can find the pre-trained checkpoints for KWS tasks.
- π Paper: Read the detailed technical report on arXiv.
- πΎ Dataset: Download the training data at SynTTS-Commands-Media-Dataset.
- π» Code: Access training scripts and inference code on GitHub.
π Benchmark Results and Analysis
We present a comprehensive benchmark of six representative acoustic models on the SynTTS-Commands-Media Dataset across both English (EN) and Chinese (ZH) subsets. All models are evaluated in terms of classification accuracy, cross-entropy loss, and parameter count, providing insights into the trade-offs between performance and model complexity in multilingual voice command recognition.
Performance Summary
| Model | EN Loss | EN Accuracy | EN Params | ZH Loss | ZH Accuracy | ZH Params |
|---|---|---|---|---|---|---|
| MicroCNN | 0.2304 | 93.22% | 4,189 | 0.5579 | 80.14% | 4,255 |
| DS-CNN | 0.0166 | 99.46% | 30,103 | 0.0677 | 97.18% | 30,361 |
| TC-ResNet | 0.0347 | 98.87% | 68,431 | 0.0884 | 96.56% | 68,561 |
| CRNN | 0.0163 | 99.50% | 1.08M | 0.0636 | 97.42% | 1.08M |
| MobileNet-V1 | 0.0167 | 99.50% | 2.65M | 0.0552 | 97.92% | 2.65M |
| EfficientNet | 0.0182 | 99.41% | 4.72M | 0.0701 | 97.93% | 4.72M |
π Key Findings
Our results demonstrate that the SynTTS-Commands dataset supports high-accuracy command recognition in both languages. Notably, the top-performing models achieve over 99.4% accuracy on English and nearly 98% on Chinese, confirming the datasetβs quality and suitability for real-world deployment.
Top Performers: Among all models, CRNN attains the best English accuracy (99.50%) and the lowest loss (0.0163). MobileNet-V1 yields the lowest loss on Chinese (0.0552) and competitive English performance (matching CRNNβs 99.50% accuracy). Interestingly, EfficientNet shows slightly higher Chinese accuracy (97.93%) than MobileNet-V1, suggesting better calibration or robustness despite a higher loss.
Accuracy-Complexity Trade-off: Lightweight models exhibit a clear trade-off. MicroCNN, with only ~4.2K parameters, achieves 93.22% accuracy on English but drops to 80.14% on Chinese, highlighting the increased difficulty of modeling tonal and phonetic richness in Mandarin with ultra-compact architectures. DS-CNN and TC-ResNet, with under 70K parameters, already recover strong performance (>96.5% in both languages), underscoring their efficiency for resource-constrained applications.
Overall, the benchmark establishes strong baselines across a wide spectrum of model scalesβfrom ultra-light MicroCNN to modern EfficientNetβdemonstrating that moderate-complexity models can deliver near-SOTA performance suitable for edge deployment.
π Citation
If you use these pre-trained models or the SynTTS-Commands dataset in your research, please cite our paper:
SynTTS-Commands: A Public Dataset for On-Device KWS via TTS-Synthesized Multilingual Speech
@misc{gan2025synttscommands,
title={SynTTS-Commands: A Public Dataset for On-Device KWS via TTS-Synthesized Multilingual Speech},
author={Lu Gan and Xi Li},
year={2025},
eprint={2511.07821},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2511.07821},
doi={10.48550/arXiv.2511.07821}
}
- Downloads last month
- 184