dattazigzag commited on
Commit
029ff89
ยท
verified ยท
1 Parent(s): 2dee05a

Upload 3 files

Browse files
Files changed (3) hide show
  1. README.md +677 -10
  2. requirements.txt +12 -0
  3. runtime.txt +1 -0
README.md CHANGED
@@ -1,13 +1,680 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- title: Kokoro Onnx
3
- emoji: ๐Ÿ“‰
4
- colorFrom: yellow
5
- colorTo: green
6
- sdk: gradio
7
- sdk_version: 5.25.2
8
- app_file: app.py
9
- pinned: false
10
- license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # README
2
+
3
+ ## Model Source
4
+
5
+ ## Kokoro-82M
6
+
7
+ 1. [Hugging Face](https://huggingface.co/hexgrad/Kokoro-82M)
8
+ 2. [Github](https://github.com/hexgrad/kokoro/tree/main?tab=Apache-2.0-1-ov-file)
9
+
10
+ ## Kokoro-onnx
11
+
12
+ 1. [Github](https://github.com/thewh1teagle/kokoro-onnx)
13
+
14
+ ## About Kokoro
15
+
16
+ Kokoro ("heart" or "spirit" in Japanese) is an open-weight TTS model with only 82 million parameters. Despite its small size, it delivers impressive voice quality across multiple languages and voices.
17
+ The model was trained exclusively on permissive/non-copyrighted audio data and IPA phoneme labels, making it suitable for both commercial and personal projects under the Apache 2.0 license.
18
+
19
+ ### Key Features (๐Ÿ˜Š)
20
+
21
+ 1. __Lightweight Architecture__: Just 82M parameters, allowing for efficient inference
22
+ 2. __Multilingual Support__: 9 languages including English, Spanish, Japanese, and more
23
+ 3. __Multiple Voices__: 60+ voices across different languages and genders
24
+ 4. __Freedom to use commercially and privately__: kokooro - Apache 2.0 Licensed & kokoro-onnx: MIT
25
+ 5. __Strong Performance__: Competitive quality with much larger models. __Continuations__ and __annotations__ are great !! โœ…
26
+ 6. __Near near real-time performance__: if models or voices are not changed, then loading time of optimized f32 version model is 2 secs . This time is not "audio generation time".
27
+ 7. Can control __speed__ of speaking and use __breaks__ as a feature for the underlying model to get sentence breakpoints and thus audio features per sentence.
28
+ 8. โ€ผ๏ธ No Voice Cloning โ€ผ๏ธ ---- ๐Ÿ’ฌ _who wants that are you crazy? I'm okay with that_
29
+ 9. โ€ผ๏ธ No German (DE) at the moment โ€ผ๏ธ . But please [check below](#voices-summary) for a list of CURRENLY available languages.
30
+
31
+ > kokoro-onnx specific
32
+
33
+ 1. ** _Even more faster performance and near real-time on macOS M sries chips_
34
+ 2. Can mix genders and voices (__Blending__) for interesting results
35
+
36
+ ```bash
37
+ nicole: np.ndarray = kokoro.get_voice_style("af_nicole")
38
+ michael: np.ndarray = kokoro.get_voice_style("am_michael")
39
+ blend = np.add(nicole * (50 / 100), michael * (50 / 100))
40
+ ```
41
+
42
+ [img tbd]
43
+
44
+ 3. Can get _runtime providers_ via built in API and use them during runtime to get even better performance
45
+
46
+ ```bash
47
+ # To list providers, simply activte your venv
48
+ >>> import onnxruntime
49
+ >>> onnxruntime.get_all_providers()
50
+
51
+ # you will see something like ...
52
+ ['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'MIGraphXExecutionProvider', 'ROCMExecutionProvider', 'OpenVINOExecutionProvider', 'DnnlExecutionProvider', 'VitisAIExecutionProvider', 'QNNExecutionProvider', 'NnapiExecutionProvider', 'VSINPUExecutionProvider', 'JsExecutionProvider', 'CoreMLExecutionProvider', 'ArmNNExecutionProvider', 'ACLExecutionProvider', 'DmlExecutionProvider', 'RknpuExecutionProvider', 'WebNNExecutionProvider', 'WebGpuExecutionProvider', 'XnnpackExecutionProvider', 'CANNExecutionProvider', 'AzureExecutionProvider', 'CPUExecutionProvider']
53
+ ```
54
+
55
+ So for mac M1 systems, we can either use that provider `'CoreMLExecutionProvider'` directly in the program in a session
56
+
57
+ ```bash
58
+ session = onnxruntime.InferenceSession(
59
+ model, providers=['CoreMLExecutionProvider', 'CPUExecutionProvider']
60
+ )
61
+ ```
62
+
63
+ or, use it during runtime
64
+
65
+ ```bash
66
+ ONNX_PROVIDER="CoreMLExecutionProviderr" python main_kokoro_onnx.py
67
+ ```
68
+
69
+ 3. We get extra log levels. This gives you better visbilityu with what's going on under the hood and not just your program ...
70
+
71
+ ```bsah
72
+ import logging
73
+ ...
74
+ logging.getLogger(kokoro_onnx.__name__).setLevel("DEBUG")
75
+ ```
76
+
77
+ 4. Have lightweight options:
78
+ 1. [kokoro-v1.0.onnx](https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0/kokoro-v1.0.onnx): (310MB): optimized f32 version
79
+ 2. [kokoro-v1.0.fp16.onnx](https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0/kokoro-v1.0.fp16.onnx): (169MB): optimized f16 version
80
+ 3. [kokoro-v1.0.int8.onnx](https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0/kokoro-v1.0.int8.onnx): (88MB): optimized int8 version
81
+ 5. You get specific GPU versions of onnyx models but for mac it always runs on GPU.
82
+
83
+ ```bsah
84
+ pip install -U kokoro-onnx[gpu]
85
+ # gpu version is sufficient only for Linux and Windows. macOS works with GPU by default
86
+ ```
87
+
88
+ ### In summary
89
+
90
+ | Feature | Kokoro | Kokoro-ONNX |
91
+ |---------|--------|-------------|
92
+ | **Architecture** | PyTorch-based with HuggingFace integration | ONNX-optimized for inference speed |
93
+ | **Model Loading** | `KPipeline(lang_code, repo_id, device)` downloads models from HuggingFace | `Kokoro.from_session(session, bin_file)` using local ONNX and voice files |
94
+ | **Language Codes** | Short codes (`"a"` = en-us, `"b"` = en-gb) | Standard language codes (`"en-us"`, `"en-gb"`) |
95
+ | **Audio Generation** | Generator pattern that yields `(graphemes, phonemes, audio)` chunks | Single API call returning `(samples, sample_rate)` |
96
+ | **Text Processing** | Supports various split patterns (`r"\n+"`, `(?<=[.!?])\s+`) | No built-in splitting (must be handled manually if needed) |
97
+ | **Hardware Acceleration** | Auto-detects CUDA/MPS/CPU | Explicitly configure providers (CoreML, CUDA, CPU) via environment variables |
98
+ | **Phonemization** | Handles internally as part of generator pattern | Separate `tokenizer.phonemize()` function (optional usage) |
99
+ | **Memory Management** | Streams audio in chunks, better for memory | Generates entire audio at once (can be an issue for long texts) |
100
+ | **Voice Data** | Downloads voice models as needed | Uses pre-bundled voice binary file |
101
+ | **Error Handling** | Detailed error handling for each generation stage | Simpler error handling for the single API call |
102
+ | **Implementation Example** | `generator = pl(text, voice, speed, split_pattern); for i, (gs, ps, audio) in enumerate(generator): ...` | `samples, sample_rate = kokoro.create(text, voice, speed, lang)` |
103
+
104
+ ---
105
+
106
+ ## Kokoro's generated audio format
107
+
108
+ 1. __Sample Rate__: Fixed at 24kHz (24000 Hz)
109
+ 2. __Channels__: Mono (single channel)
110
+ 3. __Data Format__ (`Dtype Int8`, `Int16`, etc.): Selectable during saving data to a file. By default it uses 16-bit integer format (Int16)
111
+
112
+ > In kokoo-onyx methods, the sample rate can be grabbed from the function and thus can be used for playback or file saving with prorper formatting where as in pure kokoro method, we use the known hard codded sample rate ...
113
+ > In kokoo-onyx methods, the deafult data format is HQ 32bit floating pts.
114
+
115
+ ### Test System Info
116
+
117
+ The model has been tested on the following system:
118
+
119
+ > Datta's mac
120
+
121
+ ```txt
122
+ OS: macOS 15.3.2 24D81 arm64
123
+ CPU: Apple M3 Max
124
+ Memory: 3371MiB / 36864MiB
125
+ Python: 3.12
126
+ ```
127
+
128
+ > Lower-end systems should also be capable of running the model effectively due to its lightweight architecture.
129
+
130
+ ### Repository flile structure
131
+
132
+ ```txt
133
+ .
134
+ โ”œโ”€โ”€ LICENSE
135
+ โ”œโ”€โ”€ README.md
136
+ โ”œโ”€โ”€ assets/
137
+ โ”œโ”€โ”€ audio_exports/
138
+ โ”œโ”€โ”€ examples
139
+ โ”‚ โ”œโ”€โ”€ 01_kk_play_save.py
140
+ โ”‚ โ””โ”€โ”€ 02_kk_onnx_play_save.py
141
+ โ”œโ”€โ”€ extras
142
+ โ”‚ โ”œโ”€โ”€ device_selection.py
143
+ โ”‚ โ”œโ”€โ”€ get_sound_device_info.py
144
+ โ”‚ โ”œโ”€โ”€ main_kokoro_intractive.py
145
+ โ”‚ โ””โ”€โ”€ save_to_disk_and_then_play.py
146
+ โ”œโ”€โ”€ kokoro_gradio.py
147
+ โ”œโ”€โ”€ kokoro_gradio_client_example.py
148
+ โ”œโ”€โ”€ kokoro_onnx_basic_main.py
149
+ โ”œโ”€โ”€ kokoro_onnx_gradio.py
150
+ โ”œโ”€โ”€ kokoro_onnx_gradio_client_example.py
151
+ โ”œโ”€โ”€ onnx_deps
152
+ โ”‚ โ”œโ”€โ”€ download_kokoro-onnx_deps.sh
153
+ โ”‚ โ”œโ”€โ”€ kokoro-v1.0.fp16-gpu.onnx
154
+ โ”‚ โ”œโ”€โ”€ kokoro-v1.0.fp16.onnx
155
+ โ”‚ โ”œโ”€โ”€ kokoro-v1.0.int8.onnx
156
+ โ”‚ โ”œโ”€โ”€ kokoro-v1.0.onnx
157
+ โ”‚ โ””โ”€โ”€ voices-v1.0.bin
158
+ โ”œโ”€โ”€ pyproject.toml
159
+ โ””โ”€โ”€ uv.lock
160
+ ```
161
+
162
+ 1. `assets/`: Assets for `README.md`
163
+ 2. `audio_exports/`: Dir where all the scripts export and save their audio from TTS, on disk
164
+ 3. `examples/`: Dir that has the two main headless python scripts for using pure kokoro (`01_kk_play_save.py`) or kokoro-onnx (`01_kk_play_save.py`). _**These scripts can be used as boiler plates or starting points for implementation_
165
+ 4. `extras/device_selection.py`: shows how to use kokoro runtime device (`CPU/CUDA/MLP/CPU`) - _not same for kokoro-onnx_
166
+ 5. `extras/get_sound_device_info.py`: shows how python `sounddevice` library can be used to idenify available sound card devices
167
+ 6. `extras/main_kokoro_intractive.py`: and cli interactive tool, using kokoro py lib, for testing language, voice and sentece combo for TTS. It was done befior eusing the g-radio versions but I left them there as it has a nice TUI, I like. But do not expect to do the same for kokoro-onnx (not the goal to make a pretty TUI tool).
168
+ 7. `extras/save_to_disk_and_then_play.py`: [TBD] Demo for showing how to sae TTS audio data to wav file, then load and do audio transformation, specifically to embedd playback soundcard features directly into the file. Might be helpful when multiple `soundfile` py lib can not be used for audio streaming and may need to play audio using a system levelk player like `afplay` (mac) or `aplay` (Linux), programatically from python ...
169
+ 8. `onnx_deps`: various onnx model files and voice pt files as .bin.
170
+ 9. `kokoro_gradio.py`: kokoro basic exmaple using gradio web gui as a playground.
171
+ 10. `kokoro_gradio_client_example.py`: example implementation to show how to interact with gradio kokoro server via API.
172
+ 11. `kokoro_onnx_gradio.py`: kokoro-onnx basic exmaple using gradio web gui as a playground.
173
+ 12. `kokoro_onnx_gradio_client_example.py`: example implementation to show how to interact with gradio kokoro-onnx server via API.
174
+
175
  ---
176
+
177
+ ## Option 1: Install from scratch
178
+
179
+ ```bash
180
+ # Make sure you have uv installed
181
+ uv init -p 3.12
182
+
183
+ # Create and activate virtual environment
184
+ source .venv/bin/activate.fish # or use your shell-specific activation command
185
+
186
+ # Install dependencies
187
+ uv add kokoro kokoro-onnx misaki[ja] misaki[zh] soundfile sounddevice pip colorama numpy torch scipy gradio gradio-client
188
+
189
+ # Extra
190
+ uv add
191
+ ```
192
+
193
+ ## Option 1: Quick Install from project.toml
194
+
195
+ ```bash
196
+ # Make sure your virtual environment is activated
197
+ # source .venv/bin/activate.fish # or use your shell-specific activation command
198
+
199
+ uv pip install -e .
200
+ ```
201
+
202
+ ## For kokoro-onnx, download the models
203
+
204
+ ```bash
205
+ # ** For kokoro-onnx, donwload locally onnyx models and voices
206
+ cd onnx_deps
207
+
208
+ # INT8 (88 MB):
209
+ wget https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0/kokoro-v1.0.int8.onnx
210
+ # FP16 (169 MB):
211
+ wget https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0/kokoro-v1.0.fp16.onnx
212
+ # FP32 (326 MB):
213
+ wget https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0/kokoro-v1.0.onnx
214
+
215
+ # ** For kokoro-onnx, donwload voices
216
+ wget https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0/voices-v1.0.bin
217
+ ```
218
+
219
+ or run
220
+
221
+ ```bash
222
+ cd onnx_deps
223
+ ./onnx_deps/download_kokoro-onnx_deps.sh
224
+
225
+ # By default the model version is set to v1.0
226
+
227
+ # Default version
228
+ # VERSION="v1.0"
229
+
230
+ # In the future you can just pass a version / tag to the script to download all .onnx and .bin for that specific release
231
+ # e.g.: ./onnx_deps/download_kokoro-onnx_deps.sh v1.1
232
+ ```
233
+
234
+ ## Usage 1: Basic Usage for pure kokoro with gradio-gui
235
+
236
+ Implementedin in gardio for playing around ...
237
+
238
+ ```bash
239
+ # make sure you have venv activated !!
240
+ python3 kokoro_gradio.py
241
+
242
+ # or run on MacOS Apple Silicon GPU Acceleration
243
+ PYTORCH_ENABLE_MPS_FALLBACK=1 python3 kokoro_gradio.py
244
+ ```
245
+
246
+ > __Pure kokoro limitations__
247
+ >
248
+ > _Honestly these are not a bigie_
249
+ >
250
+ > 1. Doesn't include voice blending
251
+ > 2. Programmatic assignment for GPU (process IO provider) not available
252
+ > 3. Model DEBUG info not available
253
+
254
+ ![alt text](<assets/Screenshot 2025-04-14 at 19.13.43.png>)
255
+
256
+ ## Usage 1: Scripted bare metal usage for pure kokoro from hexgrid
257
+
258
+ ### Test 1: process, combine, play and then save
259
+
260
+ 1. Load language once and do not load again as it takes some time (approx. `1-2.5 secs`).
261
+ 2. Moreover the self induced constraint (assumption) here is that we won't switch language in between
262
+ 3. Check how long does it take to generate / process a chunk of audio; first being a single line sentence. Then play it from memory (from audio data buffer) and then save it to disk for reviewing ...
263
+ 4. Then process 1st multi-line text (with pargraphs in it). Then play it from memory (from audio data buffer) and then save it to disk for reviewing ...
264
+ 5. Then process the 2nd multi-line text (also with pargraphs in it). Then also play it from memory (from audio data buffer) and then save it to disk for reviewing ...
265
+ 6. Each tie tts is carried out (text is processed and audio data buffer is generated), the pipeline wasn't recreated. the pipeline for pure kokoro, only needs to be created when language needs to be changed. voice can be changed and for that pipeline doesn;t need to be initiated (whihc is tiny bot time consumeing)
266
+ 7. __Extra__ ๐Ÿ˜‰: _I added audio transformation for stream playback and file saving. Meaning, you an match your sound card's samplerate, bitrate, can control gain, and specify outchannel/channels all from within the code, under the hood it uses [sounddevice.Stream API](https://python-sounddevice.readthedocs.io/en/0.5.1/api/streams.html#sounddevice.Stream)._
267
+
268
+ ```bash
269
+ PYTORCH_ENABLE_MPS_FALLBACK=1 python3 examples/01_kk_play_save.py
270
+ ```
271
+
272
+ ### Result
273
+
274
+ ```bash
275
+ # Pt. 1
276
+ Loading pipeline ...
277
+ Process took: 2.333237 secs.
278
+
279
+ ...
280
+
281
+ # Pt. 3. (Single line with first time pipeline loaded)
282
+ Processing Single Line Text
283
+ Initializing generator Process took: 0.000012 secs.
284
+ Chunk 1 creation took: 0.000024 secs.
285
+ Streaming audio... (audio file length): 5.547685 secs.
286
+ Saving file to disk took: 0.006516 secs.
287
+ file size: 259 KB
288
+
289
+ # Pt. 4 (Multi line sentences with reusing pipeline)
290
+ Initializing generator Process took: 0.000006 secs
291
+ Chunk 1 of 3 took: 0.000029 secs.
292
+ Chunk 2 of 3 took: 0.000024 secs.
293
+ Chunk 3 of 3 took: 0.000028 secs
294
+ Combining took: 0.000361 secs.
295
+ Streaming audio... (combined audio file length): 34.360400 secs.
296
+ Saving file to disk took: 0.013127 secs.
297
+ file size: 1.6 MB
298
+
299
+ # Pt. 5 Multi line sentences with the previous context while reusing the 1st loaded pipeline)
300
+ Initializing generator Process took: 0.000007 secs.
301
+ Chunk 1 of 3 took: 0.000023 secs.
302
+ Chunk 2 of 3 took: 0.000022 secs.
303
+ Chunk 3 of 3 took: 0.000024 secs.
304
+ Chunk 3 of 3 took: 0.000030 secs.
305
+ Combining took: 0.000704 secs.
306
+ Streaming audio... (combined audio file length): 43.408353 secs.
307
+ Saving file to disk took: 0.015742 secs.
308
+ file size: 2.1 MB
309
+ ```
310
+
311
+ > Check Figma for more details
312
+
313
+ ---
314
+
315
+ ## Extra
316
+
317
+ ```bash
318
+ cd extra
319
+ PYTORCH_ENABLE_MPS_FALLBACK=1 python3 main_kokoro_intractive.py
320
+ ```
321
+
322
+ ### Interactive Commands
323
+
324
+ When running the application, you can use these commands:
325
+
326
+ 1. `lang?` or `l?` - Display currently set language code
327
+ 2. `voice?` or `v?` - Display currently set voice
328
+ 3. `playback?` or `p?` - Display playback options
329
+ 4. `set lang to [code]` - Change language code to one from the list
330
+ 5. `set voice to [name]` - Change voice to one from the list
331
+ 6. `set playback to [mode]` - Change playback mode (`file` or `stream`)
332
+
333
+ ### Notes
334
+
335
+ 1. In playback mode `file`, it saves the generated audio data to a file and plays back on default soundcard / playback device
336
+ 2. In playback mode `stream`, it saves the generated audio data too but before that plays back the data from the memory through the default soundcard / playback device. TBH, difference is not at all significant.
337
+
338
+ In file mode, in one example, total time from generating to file saving to the file file laoding tok 0.0032 sec
339
+
340
+ ```bash
341
+ Current Settings:
342
+ Language: h
343
+ Voice: hf_alpha
344
+ Speed: 1
345
+ Playback: file
346
+
347
+
348
+ ...
349
+ Initializing generator ...
350
+ Process took: 0.0000 secs.
351
+ Generator initialized with voice: hf_alpha
352
+ Speed: 1
353
+ [0] TEXT(graphemes): It was a Sprawl voice and a Sprawl joke. The Chatsubo was a bar for professional expatriates; you could drink there for a week and never hear two words in Japanese.
354
+
355
+ (Phonemes): ษชt wสŒzษ spษนหˆษ”หl vหˆYs and ษ spษนหˆษ”หl สคหˆQk. รฐษ™ สงatsjหˆuหbQ wสŒzษ bหˆษ‘ห fษ”ห pษนษ™fหˆษ›สƒษ™nษ™l ษ›kspหˆatษนษชหŒAts; juห kสŠd dษนหˆษชล‹k รฐeษ™ fษ™ษนษ™ wหˆiหk and nหˆษ›vษ™ hหˆiษ™ tหˆuห wหˆษœหdz ษชn สคหŒapษ™nหˆiหz.
356
+
357
+ Writing audio file (For immediate playback): audio_exports/0hf_alpha.wav
358
+ Writing audio file (For immediate playback): audio_exports/0hf_alpha.wav
359
+ Process took: 0.0032 secs.
360
+ Success โœ…
361
+ Playing audio file: audio_exports/0hf_alpha.wav
362
+ ....
363
+ ```
364
+
365
+ And in, stream mode, it took 0.0011 sec (incl. file saving in the background ...)
366
+
367
+ ```bash
368
+ Initializing generator ...
369
+ Process took: 0.0000 secs.
370
+ Generator initialized with voice: hf_alpha
371
+ Speed: 1
372
+ [0] TEXT(graphemes): It was a Sprawl voice and a Sprawl joke. The Chatsubo was a bar for professional expatriates; you could drink there for a week and never hear two words in Japanese.
373
+
374
+ (Phonemes): ษชt wสŒzษ spษนหˆษ”หl vหˆYs and ษ spษนหˆษ”หl สคหˆQk. รฐษ™ สงatsjหˆuหbQ wสŒzษ bหˆษ‘ห fษ”ห pษนษ™fหˆษ›สƒษ™nษ™l ษ›kspหˆatษนษชหŒAts; juห kสŠd dษนหˆษชล‹k รฐeษ™ fษ™ษนษ™ wหˆiหk and nหˆษ›vษ™ hหˆiษ™ tหˆuห wหˆษœหdz ษชn สคหŒapษ™nหˆiหz.
375
+
376
+ Writing audio file (for recording purposes): audio_exports/0hf_alpha.wav
377
+ Process took: 0.0017 secs.
378
+ Success โœ…
379
+ Streaming audio...
380
+ ```
381
+
382
+ ## Usage 2: Basic Usage For pure kokoro-onnx with gradio-gui
383
+
384
+ ```bash
385
+ ONNX_PROVIDER="CoreMLExecutionProvider" python3 kokoro_onnx_gradio.py
386
+ ```
387
+
388
+ ![alt text](<assets/Screenshot 2025-04-14 at 19.27.25.png>)
389
+
390
+ ## Usage 2: Scripted bare metal usage for kokoro-onnx
391
+
392
+ ```bash
393
+ # On mac
394
+ ONNX_PROVIDER="CoreMLExecutionProvider" python3 examples/02_kk_onnx_play_save.py
395
+ ```
396
+
397
+ ### Result
398
+
399
+ ```bash
400
+ Model loading took: 3.503696 secs.
401
+ ONNX model loaded with provider: CoreMLExecutionProvider
402
+ Loading voice data from: /Users/saurabhdatta/Documents/Projects/VW/ArtificialAugmentation2025/tts_tests/kokoro_test/onnx_deps/voices-v1.0.bin
403
+ Voice loading took: 0.002724 secs.
404
+
405
+
406
+ Processing Single Line Text
407
+ Generating audio for text:
408
+ The sky above the port was the color of television, tuned to a dead channel.
409
+
410
+ Audio generation took: 2.039143 secs.
411
+ Playing audio on 2 channels at 48000Hz | Duration: 4.33 seconds
412
+ ...
413
+ Saving audio file...
414
+ Saving took: 0.011106 secs.
415
+
416
+
417
+ Processing Multi-Line Text 1
418
+ Generating audio for text:
419
+ Once upon a time, there was a little girl who lived in a village near the forest. Whenever she went out, the little girl wore a red riding cloak, so everyone in the village called her Little Red Riding Hood.
420
+
421
+ One morning, Little Red Riding Hood asked her mother if she could go to visit her grandmother as it had been awhile since they'd seen each other.
422
+
423
+ "That's a good idea," her mother said. "It's such a lovely day for a walk in the forest. Take this basket of fresh bread and butter to your grandmother, and remember - don't talk to strangers on the way!"
424
+
425
+ Audio generation took: 5.621120 secs.
426
+ Playing audio on 2 channels at 48000Hz | Duration: 30.14 seconds
427
+ ...
428
+ Saving audio file...
429
+ Saving took: 0.030169 secs.
430
+
431
+
432
+ Processing Multi-Line Text 2
433
+ Generating audio for text:
434
+ Little Red Riding Hood promised to be careful and set off immediately. The forest was dense and deep, with sunlight filtering through the leaves. Birds sang cheerfully as she walked along the path.
435
+
436
+ Suddenly, she met a wolf. "Hello, little girl," said the wolf in a voice as sweet as honey. "Where are you going all alone in the woods?"
437
+
438
+ Little Red Riding Hood didn't know that wolves could be dangerous, so she replied, "I'm going to visit my grandmother who lives on the other side of the forest."
439
+
440
+ The wolf smiled wickedly. "What a coincidence! I was just heading that way myself. Why don't you take the long path with all the beautiful flowers? I'll take the short path and meet you there."
441
+
442
+
443
+ Audio generation took: 6.558173 secs.
444
+ Starting playback | Duration: 37.95 seconds
445
+ ...
446
+ Saving audio file...
447
+ Saving took: 0.033002 secs.
448
+ ```
449
+
450
  ---
451
 
452
+ > ๐Ÿ’ก For more detailed benchmark side by side, please checkout the [figma space ](https://www.figma.com/board/6tqMcW6uoGlxPVknI0biAO/Artificial-Augmentation---Ongoings?node-id=101-2815&t=vqCFuLhMcO6RZa5R-4)...
453
+
454
+
455
+ # Voices Summary
456
+
457
+ ## Lang codes
458
+
459
+ ```txt
460
+ # ๐Ÿ‡บ๐Ÿ‡ธ 'a' => American English, 'b' => British English
461
+ # 'e' => Spanish es
462
+ # 'f' => French fr-fr
463
+ # 'h' => Hindi hi
464
+ # 'i' => Italian it
465
+ # 'j' => Japanese: pip install misaki[ja]
466
+ # 'p' => Brazilian Portuguese pt-br
467
+ # 'z' => Mandarin Chinese: pip install misaki[zh]
468
+ ```
469
+
470
+ | Kokoro Code | Standard Language Code | Language Description |
471
+ |-------------|------------------------|----------------------|
472
+ | `a` | `en-us` | ๐Ÿ‡บ๐Ÿ‡ธ American English |
473
+ | `b` | `en-gb` | ๐Ÿ‡ฌ๐Ÿ‡ง British English |
474
+ | `e` | `es` | ๐Ÿ‡ช๐Ÿ‡ธ Spanish (Spain) |
475
+ | `f` | `fr-fr` | ๐Ÿ‡ซ๐Ÿ‡ท French (France) |
476
+ | `h` | `hi` | ๐Ÿ‡ฎ๐Ÿ‡ณ Hindi |
477
+ | `i` | `it` | ๐Ÿ‡ฎ๐Ÿ‡น Italian |
478
+ | `j` | `ja` | ๐Ÿ‡ฏ๐Ÿ‡ต Japanese |
479
+ | `p` | `pt-br` | ๐Ÿ‡ง๐Ÿ‡ท Brazilian Portuguese |
480
+ | `z` | `zh` | ๐Ÿ‡จ๐Ÿ‡ณ Mandarin Chinese |
481
+
482
+ | Name | Traits | Target Quality | Training Duration | Overall Grade | SHA256 |
483
+ | ---- | ------ | -------------- | ----------------- | ------------- | ------ |
484
+ | **af\_heart** | ๐Ÿšบโค๏ธ | | | **A** | `0ab5709b` |
485
+ | af_alloy | ๐Ÿšบ | B | MM minutes | C | `6d877149` |
486
+ | af_aoede | ๐Ÿšบ | B | H hours | C+ | `c03bd1a4` |
487
+ | af_bella | ๐Ÿšบ๐Ÿ”ฅ | **A** | **HH hours** | **A-** | `8cb64e02` |
488
+ | af_jessica | ๐Ÿšบ | C | MM minutes | D | `cdfdccb8` |
489
+ | af_kore | ๐Ÿšบ | B | H hours | C+ | `8bfbc512` |
490
+ | af_nicole | ๐Ÿšบ๐ŸŽง | B | **HH hours** | B- | `c5561808` |
491
+ | af_nova | ๐Ÿšบ | B | MM minutes | C | `e0233676` |
492
+ | af_river | ๐Ÿšบ | C | MM minutes | D | `e149459b` |
493
+ | af_sarah | ๐Ÿšบ | B | H hours | C+ | `49bd364e` |
494
+ | af_sky | ๐Ÿšบ | B | _M minutes_ ๐Ÿค | C- | `c799548a` |
495
+ | am_adam | ๐Ÿšน | D | H hours | F+ | `ced7e284` |
496
+ | am_echo | ๐Ÿšน | C | MM minutes | D | `8bcfdc85` |
497
+ | am_eric | ๐Ÿšน | C | MM minutes | D | `ada66f0e` |
498
+ | am_fenrir | ๐Ÿšน | B | H hours | C+ | `98e507ec` |
499
+ | am_liam | ๐Ÿšน | C | MM minutes | D | `c8255075` |
500
+ | am_michael | ๐Ÿšน | B | H hours | C+ | `9a443b79` |
501
+ | am_onyx | ๐Ÿšน | C | MM minutes | D | `e8452be1` |
502
+ | am_puck | ๐Ÿšน | B | H hours | C+ | `dd1d8973` |
503
+ | am_santa | ๐Ÿšน | C | _M minutes_ ๐Ÿค | D- | `7f2f7582` |
504
+
505
+ <br>
506
+ <details>
507
+ <summary>More VOICE deatils (from hexgrad - maintainers of kokoro in HF)</summary>
508
+
509
+ - ๐Ÿ‡บ๐Ÿ‡ธ [American English](#american-english): 11F 9M
510
+ - ๐Ÿ‡ฌ๐Ÿ‡ง [British English](#british-english): 4F 4M
511
+ - ๐Ÿ‡ฏ๐Ÿ‡ต [Japanese](#japanese): 4F 1M
512
+ - ๐Ÿ‡จ๐Ÿ‡ณ [Mandarin Chinese](#mandarin-chinese): 4F 4M
513
+ - ๐Ÿ‡ช๐Ÿ‡ธ [Spanish](#spanish): 1F 2M
514
+ - ๐Ÿ‡ซ๐Ÿ‡ท [French](#french): 1F
515
+ - ๐Ÿ‡ฎ๐Ÿ‡ณ [Hindi](#hindi): 2F 2M
516
+ - ๐Ÿ‡ฎ๐Ÿ‡น [Italian](#italian): 1F 1M
517
+ - ๐Ÿ‡ง๐Ÿ‡ท [Brazilian Portuguese](#brazilian-portuguese): 1F 2M
518
+
519
+ For each voice, the given grades are intended to be estimates of the **quality and quantity** of its associated training data, both of which impact overall inference quality.
520
+
521
+ Subjectively, voices will sound better or worse to different people.
522
+
523
+ Support for non-English languages may be absent or thin due to weak G2P and/or lack of training data. Some languages are only represented by a small handful or even just one voice (French).
524
+
525
+ Most voices perform best on a "goldilocks range" of 100-200 tokens out of ~500 possible. Voices may perform worse at the extremes:
526
+ - **Weakness** on short utterances, especially less than 10-20 tokens. Root cause could be lack of short-utterance training data and/or model architecture. One possible inference mitigation is to bundle shorter utterances together.
527
+ - **Rushing** on long utterances, especially over 400 tokens. You can chunk down to shorter utterances or adjust the `speed` parameter to mitigate this.
528
+
529
+ **Target Quality**
530
+ - How high quality is the reference voice? This grade may be impacted by audio quality, artifacts, compression, & sample rate.
531
+ - How well do the text labels match the audio? Text/audio misalignment (e.g. from hallucinations) will lower this grade.
532
+
533
+ **Training Duration**
534
+ - How much audio was seen during training? Smaller durations result in a lower overall grade.
535
+ - 10 hours <= **HH hours** < 100 hours
536
+ - 1 hour <= H hours < 10 hours
537
+ - 10 minutes <= MM minutes < 100 minutes
538
+ - 1 minute <= _M minutes_ ๏ฟฝ๏ฟฝ๏ฟฝ๏ฟฝ < 10 minutes
539
+
540
+ ### American English
541
+
542
+ - `lang_code='a'` in [`misaki[en]`](https://github.com/hexgrad/misaki)
543
+ - espeak-ng `en-us` fallback
544
+
545
+ | Name | Traits | Target Quality | Training Duration | Overall Grade | SHA256 |
546
+ | ---- | ------ | -------------- | ----------------- | ------------- | ------ |
547
+ | **af\_heart** | ๐Ÿšบโค๏ธ | | | **A** | `0ab5709b` |
548
+ | af_alloy | ๐Ÿšบ | B | MM minutes | C | `6d877149` |
549
+ | af_aoede | ๐Ÿšบ | B | H hours | C+ | `c03bd1a4` |
550
+ | af_bella | ๐Ÿšบ๐Ÿ”ฅ | **A** | **HH hours** | **A-** | `8cb64e02` |
551
+ | af_jessica | ๐Ÿšบ | C | MM minutes | D | `cdfdccb8` |
552
+ | af_kore | ๐Ÿšบ | B | H hours | C+ | `8bfbc512` |
553
+ | af_nicole | ๐Ÿšบ๐ŸŽง | B | **HH hours** | B- | `c5561808` |
554
+ | af_nova | ๐Ÿšบ | B | MM minutes | C | `e0233676` |
555
+ | af_river | ๐Ÿšบ | C | MM minutes | D | `e149459b` |
556
+ | af_sarah | ๐Ÿšบ | B | H hours | C+ | `49bd364e` |
557
+ | af_sky | ๐Ÿšบ | B | _M minutes_ ๐Ÿค | C- | `c799548a` |
558
+ | am_adam | ๐Ÿšน | D | H hours | F+ | `ced7e284` |
559
+ | am_echo | ๐Ÿšน | C | MM minutes | D | `8bcfdc85` |
560
+ | am_eric | ๐Ÿšน | C | MM minutes | D | `ada66f0e` |
561
+ | am_fenrir | ๐Ÿšน | B | H hours | C+ | `98e507ec` |
562
+ | am_liam | ๐Ÿšน | C | MM minutes | D | `c8255075` |
563
+ | am_michael | ๐Ÿšน | B | H hours | C+ | `9a443b79` |
564
+ | am_onyx | ๐Ÿšน | C | MM minutes | D | `e8452be1` |
565
+ | am_puck | ๐Ÿšน | B | H hours | C+ | `dd1d8973` |
566
+ | am_santa | ๐Ÿšน | C | _M minutes_ ๐Ÿค | D- | `7f2f7582` |
567
+
568
+ ### British English
569
+
570
+ - `lang_code='b'` in [`misaki[en]`](https://github.com/hexgrad/misaki)
571
+ - espeak-ng `en-gb` fallback
572
+
573
+ | Name | Traits | Target Quality | Training Duration | Overall Grade | SHA256 |
574
+ | ---- | ------ | -------------- | ----------------- | ------------- | ------ |
575
+ | bf_alice | ๐Ÿšบ | C | MM minutes | D | `d292651b` |
576
+ | bf_emma | ๐Ÿšบ | B | **HH hours** | B- | `d0a423de` |
577
+ | bf_isabella | ๐Ÿšบ | B | MM minutes | C | `cdd4c370` |
578
+ | bf_lily | ๐Ÿšบ | C | MM minutes | D | `6e09c2e4` |
579
+ | bm_daniel | ๐Ÿšน | C | MM minutes | D | `fc3fce4e` |
580
+ | bm_fable | ๐Ÿšน | B | MM minutes | C | `d44935f3` |
581
+ | bm_george | ๐Ÿšน | B | MM minutes | C | `f1bc8122` |
582
+ | bm_lewis | ๐Ÿšน | C | H hours | D+ | `b5204750` |
583
+
584
+ ### Japanese
585
+
586
+ - `lang_code='j'` in [`misaki[ja]`](https://github.com/hexgrad/misaki)
587
+ - Total Japanese training data: H hours
588
+
589
+ | Name | Traits | Target Quality | Training Duration | Overall Grade | SHA256 | CC BY |
590
+ | ---- | ------ | -------------- | ----------------- | ------------- | ------ | ----- |
591
+ | jf_alpha | ๐Ÿšบ | B | H hours | C+ | `1bf4c9dc` | |
592
+ | jf_gongitsune | ๐Ÿšบ | B | MM minutes | C | `1b171917` | [gongitsune](https://github.com/koniwa/koniwa/blob/master/source/tnc/tnc__gongitsune.txt) |
593
+ | jf_nezumi | ๐Ÿšบ | B | _M minutes_ ๐Ÿค | C- | `d83f007a` | [nezuminoyomeiri](https://github.com/koniwa/koniwa/blob/master/source/tnc/tnc__nezuminoyomeiri.txt) |
594
+ | jf_tebukuro | ๐Ÿšบ | B | MM minutes | C | `0d691790` | [tebukurowokaini](https://github.com/koniwa/koniwa/blob/master/source/tnc/tnc__tebukurowokaini.txt) |
595
+ | jm_kumo | ๐Ÿšน | B | _M minutes_ ๐Ÿค | C- | `98340afd` | [kumonoito](https://github.com/koniwa/koniwa/blob/master/source/tnc/tnc__kumonoito.txt) |
596
+
597
+ ### Mandarin Chinese
598
+
599
+ - `lang_code='z'` in [`misaki[zh]`](https://github.com/hexgrad/misaki)
600
+ - Total Mandarin Chinese training data: H hours
601
+
602
+ | Name | Traits | Target Quality | Training Duration | Overall Grade | SHA256 |
603
+ | ---- | ------ | -------------- | ----------------- | ------------- | ------ |
604
+ | zf_xiaobei | ๐Ÿšบ | C | MM minutes | D | `9b76be63` |
605
+ | zf_xiaoni | ๐Ÿšบ | C | MM minutes | D | `95b49f16` |
606
+ | zf_xiaoxiao | ๐Ÿšบ | C | MM minutes | D | `cfaf6f2d` |
607
+ | zf_xiaoyi | ๐Ÿšบ | C | MM minutes | D | `b5235dba` |
608
+ | zm_yunjian | ๐Ÿšน | C | MM minutes | D | `76cbf8ba` |
609
+ | zm_yunxi | ๐Ÿšน | C | MM minutes | D | `dbe6e1ce` |
610
+ | zm_yunxia | ๐Ÿšน | C | MM minutes | D | `bb2b03b0` |
611
+ | zm_yunyang | ๐Ÿšน | C | MM minutes | D | `5238ac22` |
612
+
613
+ ### Spanish
614
+
615
+ - `lang_code='e'` in [`misaki[en]`](https://github.com/hexgrad/misaki)
616
+ - espeak-ng `es`
617
+
618
+ | Name | Traits | SHA256 |
619
+ | ---- | ------ | ------ |
620
+ | ef_dora | ๐Ÿšบ | `d9d69b0f` |
621
+ | em_alex | ๐Ÿšน | `5eac53f7` |
622
+ | em_santa | ๐Ÿšน | `aa8620cb` |
623
+
624
+ ### French
625
+
626
+ - `lang_code='f'` in [`misaki[en]`](https://github.com/hexgrad/misaki)
627
+ - espeak-ng `fr-fr`
628
+ - Total French training data: <11 hours
629
+
630
+ | Name | Traits | Target Quality | Training Duration | Overall Grade | SHA256 | CC BY |
631
+ | ---- | ------ | -------------- | ----------------- | ------------- | ------ | ----- |
632
+ | ff_siwis | ๐Ÿšบ | B | <11 hours | B- | `8073bf2d` | [SIWIS](https://datashare.ed.ac.uk/handle/10283/2353) |
633
+
634
+ ### Hindi
635
+
636
+ - `lang_code='h'` in [`misaki[en]`](https://github.com/hexgrad/misaki)
637
+ - espeak-ng `hi`
638
+ - Total Hindi training data: H hours
639
+
640
+ | Name | Traits | Target Quality | Training Duration | Overall Grade | SHA256 |
641
+ | ---- | ------ | -------------- | ----------------- | ------------- | ------ |
642
+ | hf_alpha | ๐Ÿšบ | B | MM minutes | C | `06906fe0` |
643
+ | hf_beta | ๐Ÿšบ | B | MM minutes | C | `63c0a1a6` |
644
+ | hm_omega | ๐Ÿšน | B | MM minutes | C | `b55f02a8` |
645
+ | hm_psi | ๐Ÿšน | B | MM minutes | C | `2f0f055c` |
646
+
647
+ ### Italian
648
+
649
+ - `lang_code='i'` in [`misaki[en]`](https://github.com/hexgrad/misaki)
650
+ - espeak-ng `it`
651
+ - Total Italian training data: H hours
652
+
653
+ | Name | Traits | Target Quality | Training Duration | Overall Grade | SHA256 |
654
+ | ---- | ------ | -------------- | ----------------- | ------------- | ------ |
655
+ | if_sara | ๐Ÿšบ | B | MM minutes | C | `6c0b253b` |
656
+ | im_nicola | ๐Ÿšน | B | MM minutes | C | `234ed066` |
657
+
658
+ ### Brazilian Portuguese
659
+
660
+ - `lang_code='p'` in [`misaki[en]`](https://github.com/hexgrad/misaki)
661
+ - espeak-ng `pt-br`
662
+
663
+ | Name | Traits | SHA256 |
664
+ | ---- | ------ | ------ |
665
+ | pf_dora | ๐Ÿšบ | `07e4ff98` |
666
+ | pm_alex | ๐Ÿšน | `cf0ba8c5` |
667
+ | pm_santa | ๐Ÿšน | `d4210316` |
668
+ </details>
669
+
670
+ ----
671
+
672
+ ## Training
673
+
674
+ > Why on earth you would wanna do that ... ๐Ÿค” ?
675
+
676
+ | Training Costs | v0.19 | v1.0 | Total |
677
+ | --- | --- | --- | --- |
678
+ | In A100 80GB GPU hours | 500 | 500 | 1000 |
679
+ | Average hourly rate | $0.80/h | $1.20/h | $1/h |
680
+ | In USD | $400 | $600 | $1000 |
requirements.txt ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ torch>=2.6.0
2
+ gradio>=5.25.0
3
+ gradio-client>=1.8.0
4
+ kokoro>=0.9.4
5
+ kokoro-onnx>=0.4.8
6
+ misaki[ja,zh]>=0.9.4
7
+ numpy>=2.2.4
8
+ scipy>=1.15.2
9
+ sounddevice>=0.5.1
10
+ soundfile>=0.13.1
11
+ jack-client>=0.5.5
12
+ colorama>=0.4.6
runtime.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ python-3.12