SynTTS Commands Media Benchmarks

arXiv Dataset Code License


πŸš€ Project Navigation

Welcome to the official model repository for the paper "SynTTS-Commands". Here you can find the pre-trained checkpoints for KWS tasks.

  • πŸ“„ Paper: Read the detailed technical report on arXiv.
  • πŸ’Ύ Dataset: Download the training data at SynTTS-Commands-Media-Dataset.
  • πŸ’» Code: Access training scripts and inference code on GitHub.


πŸ“ˆ Benchmark Results and Analysis

We present a comprehensive benchmark of six representative acoustic models on the SynTTS-Commands-Media Dataset across both English (EN) and Chinese (ZH) subsets. All models are evaluated in terms of classification accuracy, cross-entropy loss, and parameter count, providing insights into the trade-offs between performance and model complexity in multilingual voice command recognition.

Performance Summary

Model EN Loss EN Accuracy EN Params ZH Loss ZH Accuracy ZH Params
MicroCNN 0.2304 93.22% 4,189 0.5579 80.14% 4,255
DS-CNN 0.0166 99.46% 30,103 0.0677 97.18% 30,361
TC-ResNet 0.0347 98.87% 68,431 0.0884 96.56% 68,561
CRNN 0.0163 99.50% 1.08M 0.0636 97.42% 1.08M
MobileNet-V1 0.0167 99.50% 2.65M 0.0552 97.92% 2.65M
EfficientNet 0.0182 99.41% 4.72M 0.0701 97.93% 4.72M

πŸ” Key Findings

Our results demonstrate that the SynTTS-Commands dataset supports high-accuracy command recognition in both languages. Notably, the top-performing models achieve over 99.4% accuracy on English and nearly 98% on Chinese, confirming the dataset’s quality and suitability for real-world deployment.

  • Top Performers: Among all models, CRNN attains the best English accuracy (99.50%) and the lowest loss (0.0163). MobileNet-V1 yields the lowest loss on Chinese (0.0552) and competitive English performance (matching CRNN’s 99.50% accuracy). Interestingly, EfficientNet shows slightly higher Chinese accuracy (97.93%) than MobileNet-V1, suggesting better calibration or robustness despite a higher loss.

  • Accuracy-Complexity Trade-off: Lightweight models exhibit a clear trade-off. MicroCNN, with only ~4.2K parameters, achieves 93.22% accuracy on English but drops to 80.14% on Chinese, highlighting the increased difficulty of modeling tonal and phonetic richness in Mandarin with ultra-compact architectures. DS-CNN and TC-ResNet, with under 70K parameters, already recover strong performance (>96.5% in both languages), underscoring their efficiency for resource-constrained applications.

Overall, the benchmark establishes strong baselines across a wide spectrum of model scalesβ€”from ultra-light MicroCNN to modern EfficientNetβ€”demonstrating that moderate-complexity models can deliver near-SOTA performance suitable for edge deployment.

πŸ“œ Citation

If you use these pre-trained models or the SynTTS-Commands dataset in your research, please cite our paper:

SynTTS-Commands: A Public Dataset for On-Device KWS via TTS-Synthesized Multilingual Speech

@misc{gan2025synttscommands,
      title={SynTTS-Commands: A Public Dataset for On-Device KWS via TTS-Synthesized Multilingual Speech}, 
      author={Lu Gan and Xi Li},
      year={2025},
      eprint={2511.07821},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2511.07821}, 
      doi={10.48550/arXiv.2511.07821}
}
Downloads last month
184
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support