File size: 4,496 Bytes
d9c7b42
16169a3
 
d9c7b42
 
 
 
 
766f3ce
d9c7b42
16169a3
766f3ce
d9c7b42
 
16169a3
d9c7b42
43fea58
d9c7b42
16169a3
 
 
43fea58
16169a3
 
 
 
 
 
 
 
 
43fea58
16169a3
7474813
16169a3
 
 
 
 
 
 
 
 
 
7474813
 
16169a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21019df
16169a3
 
 
 
 
 
 
 
 
 
 
 
68a93fe
16169a3
 
21019df
 
 
16169a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7474813
16169a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7474813
16169a3
 
 
 
21019df
 
 
 
 
16169a3
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
title: Accent Analyzer Agent
emoji: 🏒
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
- streamlit
pinned: false
short_description: Various english accent detection
license: mit
---

# Accent Analyzer

This is a Streamlit-based web application that analyzes the English accent in spoken videos. Users can provide a public video URL (MP4), receive a transcription of the speech using Whisper Base, and ask follow-up questions based on the transcript using Gemma3:1b.

## What It Does

- Accepts a public **MP4 video URL**
- Extracts audio and transcribes it using **OpenAI Whisper Base**
- Detects accent using a **Jzuluaga/accent-id-commonaccent_xlsr-en-english** model
- Lets users ask **follow-up questions** about the transcript using **Gemma3**
- Deploys easily on **Hugging Face Spaces** with CPU

---

## Tech Stack

- **Streamlit** β€” UI  
- **OpenAI Whisper (base)**: For speech-to-text transcription.
- **Jzuluaga/accent-id-commonaccent_xlsr-en-english**: For English accent classification.
- **Gemma3:1b via Ollama**: For generating answers to follow-up questions using context from the transcript.
- **Docker** β€” containerized for deployment  
- **Hugging Face Spaces** β€” for hosting with CPU  

---

## Project Structure

```
accent-analyzer/
β”œβ”€β”€ Dockerfile                  # Container setup
β”œβ”€β”€ start.sh                    # Serving Ollama and app setup
β”œβ”€β”€ README.md                   # Instruction about the app
β”œβ”€β”€ requirements.txt            # Python dependencies
β”œβ”€β”€ streamlit_app.py            # Main UI app
└── src/
    β”œβ”€β”€ custome_interface.py    # SpeechBrain custom interface
    β”œβ”€β”€ tools/
    β”‚   └── accent_tool.py      # Audio analysis tool
    └── app/
        └── main_agent.py       # Analysis + LLaMA agents
```

---

## Running Locally (GPU Required)

1. Clone the repo:

```bash
git clone https://huggingface.co/spaces/ash-171/accent-detection
cd accent-analyzer
```

2. Build the Docker image:

```bash
docker build -t accent-analyzer .
```

3. Run the container:

```bash
docker run --gpus all -p 8501:8501 accent-analyzer
```

4. You can also run : `streamlit run streamlit_app.py` to deploy the app locally.

5. Visit: [http://localhost:8501](http://localhost:8501)

---


## Requirements

`requirements.txt` should include at least:

```
streamlit>=1.25.0
requests==2.31.0
pydub==0.25.1
torch==1.11.0
torchaudio==0.11.0
speechbrain==0.5.12
transformers==4.29.2
asyncio==3.4.3
ffmpeg-python==0.2.0
openai-whisper==20230314
numpy==1.22.4
langchain>=0.1.0
langchain-community>=0.0.30
torchvision==0.12.0
langgraph>=0.0.20

```

---

## Notes

- Gemma3:1b is accessed via **Ollama** inside Docker β€” ensure it pulls on build.
- `custome_interface.py` is required by the accent model β€” it’s automatically downloaded in Dockerfile.
- Video URLs must be **direct links** to `.mp4` files.

---

## Example Prompt

```
Analyze this video: https://www.learningcontainer.com/wp-content/uploads/2020/05/sample-mp4-file.mp4
```

Then follow up with:

```
Where is the speaker probably from?
What is the tone or emotion?
Summarize the video?
```

---
## Acknowledgments

This project uses the following models, frameworks, and tools:

- [OpenAI Whisper](https://github.com/openai/whisper): Automatic speech recognition model.
- [SpeechBrain](https://speechbrain.readthedocs.io/): Toolkit used for building and fine-tuning speech processing models.
- [Accent-ID CommonAccent](https://huggingface.co/Jzuluaga/accent-id-commonaccent_xlsr-en-english): Fine-tuned wav2vec2 model hosted on Hugging Face for English accent classification.
- [CustomEncoderWav2vec2Classifier](https://huggingface.co/Jzuluaga/accent-id-commonaccent_xlsr-en-english/blob/main/custom_interface.py): Custom interface used to load and run the accent model.
- [Gemma3:1b](https://ollama.com/library/gemma3:1b) via [Ollama](https://ollama.com): Large language model used for natural language follow-up based on transcripts.
- [Streamlit](https://streamlit.io): Python framework for building web applications.
- [Hugging Face Spaces](https://huggingface.co/spaces): Platform used for deploying this application on GPU infrastructure.


---
## Note

Due to unavailability of GPU the app will be extremely slow. The output has been test in local system and verified.

---

## Author

- Developed by [Aswathi T S](https://github.com/ash-171)

---

## License

This project is licensed under the `MIT License`.