Spaces:
Running
Running
title: Wav2Vec2 Wake Word Detection | |
emoji: π€ | |
colorFrom: blue | |
colorTo: purple | |
sdk: gradio | |
sdk_version: "4.44.1" | |
app_file: app.py | |
pinned: false | |
# π€ Wav2Vec2 Wake Word Detection Demo | |
A powerful, interactive wake word detection demo built with Hugging Face Transformers and Gradio. This demo uses the **proven** Wav2Vec2 model with verified Hugging Face Spaces compatibility (73 active Spaces, 4,758 monthly downloads). | |
## β¨ Features | |
- **State-of-the-art Wake Word Detection**: Uses Wav2Vec2 Base model fine-tuned for keyword spotting | |
- **Interactive Web Interface**: Clean, modern Gradio interface with audio recording and upload | |
- **Real-time Processing**: Instant wake word detection with confidence scores | |
- **12 Keyword Classes**: Detects "yes", "no", "up", "down", "left", "right", "on", "off", "stop", "go" plus silence and unknown | |
- **Microphone Support**: Record audio directly in the browser or upload audio files | |
- **Example Audio**: Synthetic audio generation for quick testing | |
- **Responsive Design**: Works on desktop and mobile devices | |
- **Spaces Verified**: Proven to work reliably on Hugging Face Spaces (73 active implementations) | |
## π Quick Start | |
### Online Demo | |
Visit the Hugging Face Space to try the demo immediately in your browser. | |
### Local Installation | |
1. **Clone the repository:** | |
```bash | |
git clone <your-repo-url> | |
cd wake-word-demo | |
``` | |
2. **Install dependencies:** | |
```bash | |
pip install -r requirements.txt | |
``` | |
3. **Run the demo:** | |
```bash | |
python app.py | |
``` | |
4. **Open your browser** and navigate to the local URL (typically `http://localhost:7860`) | |
## π§ Technical Details | |
### Model Information | |
- **Model**: `superb/wav2vec2-base-superb-ks` | |
- **Architecture**: Wav2Vec2 Base fine-tuned for keyword spotting | |
- **Dataset**: Speech Commands dataset v1.0 | |
- **Accuracy**: 96.4% on test set | |
- **Parameters**: ~95M parameters | |
- **Input**: 16kHz audio samples | |
- **Spaces Usage**: 73 active Spaces (verified compatibility) | |
### Performance Metrics | |
- **Accuracy**: 96.4% on Speech Commands dataset | |
- **Model Size**: 95M parameters | |
- **Inference Time**: ~200ms (CPU), ~50ms (GPU) | |
- **Sample Rate**: 16kHz | |
- **Supported Keywords**: yes, no, up, down, left, right, on, off, stop, go, silence, unknown | |
- **Monthly Downloads**: 4,758 (highly trusted) | |
### Supported Audio Formats | |
- WAV, MP3, FLAC, M4A | |
- Automatic resampling to 16kHz | |
- Mono and stereo support (automatically converted to mono) | |
## π― Use Cases | |
- **Voice Assistants**: Wake word detection for smart devices | |
- **IoT Applications**: Voice control for embedded systems | |
- **Accessibility**: Voice-controlled interfaces | |
- **Smart Home**: Voice commands for home automation | |
- **Mobile Apps**: Offline keyword detection | |
## π οΈ Customization | |
### Adding New Keywords | |
To add support for additional keywords, you would need to: | |
1. Fine-tune the model on your custom keyword dataset | |
2. Update the model configuration | |
3. Modify the interface labels | |
### Changing Audio Settings | |
Edit the audio processing parameters in `app.py`: | |
```python | |
# Audio configuration | |
SAMPLE_RATE = 16000 # Required by the model | |
MAX_AUDIO_LENGTH = 1.0 # seconds | |
``` | |
### Interface Customization | |
Modify the Gradio interface theme and styling in the `app.py` file to match your branding. | |
## π Model Comparison | |
| Model | Accuracy | Size | Speed | Keywords | Spaces Usage | | |
|-------|----------|------|-------|----------|--------------| | |
| **Wav2Vec2-Base-KS** | **96.4%** | **95M** | **Fast** | **12 classes** | **73 Spaces β** | | |
| HuBERT-Large-KS | 95.3% | 300M | Slower | 12 classes | 0 Spaces β | | |
| DistilHuBERT-KS | 97.1% | 24M | Fastest | 12 classes | Unknown | | |
## π€ Contributing | |
Contributions are welcome! Please feel free to submit issues, feature requests, or pull requests. | |
### Development Setup | |
1. Fork the repository | |
2. Create a feature branch | |
3. Make your changes | |
4. Test thoroughly | |
5. Submit a pull request | |
## π License | |
This project is licensed under the MIT License - see the LICENSE file for details. | |
## π Acknowledgments | |
- **Hugging Face**: For the Transformers library and model hosting | |
- **SUPERB Benchmark**: For the fine-tuned keyword spotting models | |
- **Speech Commands Dataset**: For the training data | |
- **Gradio**: For the excellent web interface framework | |
## π References | |
- [SUPERB: Speech processing Universal PERformance Benchmark](https://arxiv.org/abs/2105.01051) | |
- [Wav2Vec2: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) | |
- [Speech Commands Dataset](https://arxiv.org/abs/1804.03209) | |
--- | |
**Built with β€οΈ using Hugging Face Transformers and Gradio** | |
**β Verified to work on Hugging Face Spaces** |