File size: 2,403 Bytes
71acd53
 
 
 
 
c549dab
724aa35
71acd53
 
 
 
c549dab
60bd17d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
title: TeachingAssistant
emoji: πŸš€
colorFrom: gray
colorTo: blue
sdk: streamlit
sdk_version: 1.44.1
app_file: app.py
pinned: false
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

# Speech Recognition Module Refactoring

## Overview

The speech recognition module (`utils/stt.py`) has been refactored to support multiple ASR (Automatic Speech Recognition) models. The implementation now follows a factory pattern that allows easy switching between different speech recognition models while maintaining a consistent interface.

## Supported Models

### 1. Whisper (Default)
- Based on OpenAI's Whisper Large-v3 model
- High accuracy for general speech recognition
- No additional installation required

### 2. Parakeet
- NVIDIA's Parakeet-TDT-0.6B model
- Optimized for real-time transcription
- Requires additional installation (see below)

## Installation

### For Parakeet Support

To use the Parakeet model, you need to install the NeMo Toolkit:

```bash
pip install -U 'nemo_toolkit[asr]'
```

Alternatively, you can use the provided requirements file:

```bash
pip install -r requirements-parakeet.txt
```

## Usage

### In the Web Application

The web application now includes a dropdown menu to select the ASR model. Simply choose your preferred model before uploading an audio file.

### Programmatic Usage

```python
from utils.stt import transcribe_audio

# Using the default Whisper model
text = transcribe_audio("path/to/audio.wav")

# Using the Parakeet model
text = transcribe_audio("path/to/audio.wav", model_name="parakeet")
```

### Direct Model Access

For more advanced usage, you can directly access the model classes:

```python
from utils.stt import ASRFactory

# Get a specific model instance
whisper_model = ASRFactory.get_model("whisper")
parakeet_model = ASRFactory.get_model("parakeet")

# Use the model directly
text = whisper_model.transcribe("path/to/audio.wav")
```

## Architecture

The refactored code follows these design patterns:

1. **Abstract Base Class**: `ASRModel` defines the interface for all speech recognition models
2. **Factory Pattern**: `ASRFactory` creates the appropriate model instance based on the requested model name
3. **Strategy Pattern**: Different model implementations can be swapped at runtime

This architecture makes it easy to add support for additional ASR models in the future.