kokoro_test / README.md
dattazigzag's picture
Update README.md
c8a01ff verified

A newer version of the Gradio SDK is available: 5.42.0

Upgrade
metadata
license: apache-2.0
title: kokoro_test
sdk: gradio
sdk_version: 5.25.2
emoji: ๐Ÿจ
colorFrom: yellow
colorTo: red
pinned: true
short_description: a gardio for kokoro
thumbnail: >-
  https://cdn-uploads.huggingface.co/production/uploads/67ff88a5f2572583c60fd4cd/PkEkbfhPiKf8e_LQrfeuc.png

README

Model Source

Kokoro-82M

  1. Hugging Face
  2. Github

Kokoro-onnx

  1. Github

About Kokoro

Kokoro ("heart" or "spirit" in Japanese) is an open-weight TTS model with only 82 million parameters. Despite its small size, it delivers impressive voice quality across multiple languages and voices. The model was trained exclusively on permissive/non-copyrighted audio data and IPA phoneme labels, making it suitable for both commercial and personal projects under the Apache 2.0 license.

Key Features (๐Ÿ˜Š)

  1. Lightweight Architecture: Just 82M parameters, allowing for efficient inference
  2. Multilingual Support: 9 languages including English, Spanish, Japanese, and more
  3. Multiple Voices: 60+ voices across different languages and genders
  4. Freedom to use commercially and privately: kokooro - Apache 2.0 Licensed & kokoro-onnx: MIT
  5. Strong Performance: Competitive quality with much larger models. Continuations and annotations are great !! โœ…
  6. Near near real-time performance: if models or voices are not changed, then loading time of optimized f32 version model is 2 secs . This time is not "audio generation time".
  7. Can control speed of speaking and use breaks as a feature for the underlying model to get sentence breakpoints and thus audio features per sentence.
  8. โ€ผ๏ธ No Voice Cloning โ€ผ๏ธ ---- ๐Ÿ’ฌ who wants that are you crazy? I'm okay with that
  9. โ€ผ๏ธ No German (DE) at the moment โ€ผ๏ธ . But please check below for a list of CURRENLY available languages.

kokoro-onnx specific

  1. ** Even more faster performance and near real-time on macOS M sries chips
  2. Can mix genders and voices (Blending) for interesting results
nicole: np.ndarray = kokoro.get_voice_style("af_nicole")
michael: np.ndarray = kokoro.get_voice_style("am_michael")
blend = np.add(nicole * (50 / 100), michael * (50 / 100))

[img tbd]

  1. Can get runtime providers via built in API and use them during runtime to get even better performance
# To list providers, simply activte your venv
>>> import onnxruntime
>>> onnxruntime.get_all_providers()

# you will see something like ...
['TensorrtExecutionProvider', 'CUDAExecutionProvider', 'MIGraphXExecutionProvider', 'ROCMExecutionProvider', 'OpenVINOExecutionProvider', 'DnnlExecutionProvider', 'VitisAIExecutionProvider', 'QNNExecutionProvider', 'NnapiExecutionProvider', 'VSINPUExecutionProvider', 'JsExecutionProvider', 'CoreMLExecutionProvider', 'ArmNNExecutionProvider', 'ACLExecutionProvider', 'DmlExecutionProvider', 'RknpuExecutionProvider', 'WebNNExecutionProvider', 'WebGpuExecutionProvider', 'XnnpackExecutionProvider', 'CANNExecutionProvider', 'AzureExecutionProvider', 'CPUExecutionProvider']

So for mac M1 systems, we can either use that provider 'CoreMLExecutionProvider' directly in the program in a session

session = onnxruntime.InferenceSession(
        model, providers=['CoreMLExecutionProvider', 'CPUExecutionProvider']
)

or, use it during runtime

ONNX_PROVIDER="CoreMLExecutionProviderr" python main_kokoro_onnx.py
  1. We get extra log levels. This gives you better visbilityu with what's going on under the hood and not just your program ...
import logging
...
logging.getLogger(kokoro_onnx.__name__).setLevel("DEBUG")
  1. Have lightweight options:
    1. kokoro-v1.0.onnx: (310MB): optimized f32 version
    2. kokoro-v1.0.fp16.onnx: (169MB): optimized f16 version
    3. kokoro-v1.0.int8.onnx: (88MB): optimized int8 version
  2. You get specific GPU versions of onnyx models but for mac it always runs on GPU.
pip install -U kokoro-onnx[gpu]
#  gpu version is sufficient only for Linux and Windows. macOS works with GPU by default

In summary

Feature Kokoro Kokoro-ONNX
Architecture PyTorch-based with HuggingFace integration ONNX-optimized for inference speed
Model Loading KPipeline(lang_code, repo_id, device) downloads models from HuggingFace Kokoro.from_session(session, bin_file) using local ONNX and voice files
Language Codes Short codes ("a" = en-us, "b" = en-gb) Standard language codes ("en-us", "en-gb")
Audio Generation Generator pattern that yields (graphemes, phonemes, audio) chunks Single API call returning (samples, sample_rate)
Text Processing Supports various split patterns (r"\n+", (?<=[.!?])\s+) No built-in splitting (must be handled manually if needed)
Hardware Acceleration Auto-detects CUDA/MPS/CPU Explicitly configure providers (CoreML, CUDA, CPU) via environment variables
Phonemization Handles internally as part of generator pattern Separate tokenizer.phonemize() function (optional usage)
Memory Management Streams audio in chunks, better for memory Generates entire audio at once (can be an issue for long texts)
Voice Data Downloads voice models as needed Uses pre-bundled voice binary file
Error Handling Detailed error handling for each generation stage Simpler error handling for the single API call
Implementation Example generator = pl(text, voice, speed, split_pattern); for i, (gs, ps, audio) in enumerate(generator): ... samples, sample_rate = kokoro.create(text, voice, speed, lang)

Kokoro's generated audio format

  1. Sample Rate: Fixed at 24kHz (24000 Hz)
  2. Channels: Mono (single channel)
  3. Data Format (Dtype Int8, Int16, etc.): Selectable during saving data to a file. By default it uses 16-bit integer format (Int16)

In kokoo-onyx methods, the sample rate can be grabbed from the function and thus can be used for playback or file saving with prorper formatting where as in pure kokoro method, we use the known hard codded sample rate ... In kokoo-onyx methods, the deafult data format is HQ 32bit floating pts.

Test System Info

The model has been tested on the following system:

Datta's mac

OS: macOS 15.3.2 24D81 arm64
CPU: Apple M3 Max
Memory: 3371MiB / 36864MiB
Python: 3.12

Lower-end systems should also be capable of running the model effectively due to its lightweight architecture.

Repository flile structure

.
โ”œโ”€โ”€ LICENSE
โ”œโ”€โ”€ README.md
โ”œโ”€โ”€ assets/
โ”œโ”€โ”€ audio_exports/
โ”œโ”€โ”€ examples
โ”‚   โ”œโ”€โ”€ 01_kk_play_save.py
โ”‚   โ””โ”€โ”€ 02_kk_onnx_play_save.py
โ”œโ”€โ”€ extras
โ”‚   โ”œโ”€โ”€ device_selection.py
โ”‚   โ”œโ”€โ”€ get_sound_device_info.py
โ”‚   โ”œโ”€โ”€ main_kokoro_intractive.py
โ”‚   โ””โ”€โ”€ save_to_disk_and_then_play.py
โ”œโ”€โ”€ kokoro_gradio.py
โ”œโ”€โ”€ kokoro_gradio_client_example.py
โ”œโ”€โ”€ kokoro_onnx_basic_main.py
โ”œโ”€โ”€ kokoro_onnx_gradio.py
โ”œโ”€โ”€ kokoro_onnx_gradio_client_example.py
โ”œโ”€โ”€ onnx_deps
โ”‚   โ”œโ”€โ”€ download_kokoro-onnx_deps.sh
โ”‚   โ”œโ”€โ”€ kokoro-v1.0.fp16-gpu.onnx
โ”‚   โ”œโ”€โ”€ kokoro-v1.0.fp16.onnx
โ”‚   โ”œโ”€โ”€ kokoro-v1.0.int8.onnx
โ”‚   โ”œโ”€โ”€ kokoro-v1.0.onnx
โ”‚   โ””โ”€โ”€ voices-v1.0.bin
โ”œโ”€โ”€ pyproject.toml
โ””โ”€โ”€ uv.lock
  1. assets/: Assets for README.md
  2. audio_exports/: Dir where all the scripts export and save their audio from TTS, on disk
  3. examples/: Dir that has the two main headless python scripts for using pure kokoro (01_kk_play_save.py) or kokoro-onnx (01_kk_play_save.py). **These scripts can be used as boiler plates or starting points for implementation
  4. extras/device_selection.py: shows how to use kokoro runtime device (CPU/CUDA/MLP/CPU) - not same for kokoro-onnx
  5. extras/get_sound_device_info.py: shows how python sounddevice library can be used to idenify available sound card devices
  6. extras/main_kokoro_intractive.py: and cli interactive tool, using kokoro py lib, for testing language, voice and sentece combo for TTS. It was done befior eusing the g-radio versions but I left them there as it has a nice TUI, I like. But do not expect to do the same for kokoro-onnx (not the goal to make a pretty TUI tool).
  7. extras/save_to_disk_and_then_play.py: [TBD] Demo for showing how to sae TTS audio data to wav file, then load and do audio transformation, specifically to embedd playback soundcard features directly into the file. Might be helpful when multiple soundfile py lib can not be used for audio streaming and may need to play audio using a system levelk player like afplay (mac) or aplay (Linux), programatically from python ...
  8. onnx_deps: various onnx model files and voice pt files as .bin.
  9. kokoro_gradio.py: kokoro basic exmaple using gradio web gui as a playground.
  10. kokoro_gradio_client_example.py: example implementation to show how to interact with gradio kokoro server via API.
  11. kokoro_onnx_gradio.py: kokoro-onnx basic exmaple using gradio web gui as a playground.
  12. kokoro_onnx_gradio_client_example.py: example implementation to show how to interact with gradio kokoro-onnx server via API.

Option 1: Install from scratch

# Make sure you have uv installed
uv init -p 3.12

# Create and activate virtual environment
source .venv/bin/activate.fish # or use your shell-specific activation command

# Install dependencies
uv add kokoro kokoro-onnx misaki[ja] misaki[zh] soundfile sounddevice pip colorama numpy torch scipy gradio gradio-client

# Extra
uv add 

Option 1: Quick Install from project.toml

# Make sure your virtual environment is activated
# source .venv/bin/activate.fish # or use your shell-specific activation command

uv pip install -e .

For kokoro-onnx, download the models

# ** For kokoro-onnx, donwload locally onnyx models and voices
cd onnx_deps

# INT8 (88 MB):
wget https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0/kokoro-v1.0.int8.onnx
# FP16 (169 MB):
wget https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0/kokoro-v1.0.fp16.onnx
# FP32 (326 MB):
wget https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0/kokoro-v1.0.onnx

# ** For kokoro-onnx, donwload voices
wget https://github.com/thewh1teagle/kokoro-onnx/releases/download/model-files-v1.0/voices-v1.0.bin

or run

cd onnx_deps
./onnx_deps/download_kokoro-onnx_deps.sh 

# By default the model version is set to v1.0

# Default version
# VERSION="v1.0"

# In the future you can just pass a version / tag to the script to download all .onnx and .bin for that specific release
# e.g.: ./onnx_deps/download_kokoro-onnx_deps.sh v1.1

Usage 1: Basic Usage for pure kokoro with gradio-gui

Implementedin in gardio for playing around ...

# make sure you have venv activated !!
python3 kokoro_gradio.py

# or run on MacOS Apple Silicon GPU Acceleration
PYTORCH_ENABLE_MPS_FALLBACK=1 python3 kokoro_gradio.py

Pure kokoro limitations

Honestly these are not a bigie

  1. Doesn't include voice blending
  2. Programmatic assignment for GPU (process IO provider) not available
  3. Model DEBUG info not available

alt text

Usage 1: Scripted bare metal usage for pure kokoro from hexgrid

Test 1: process, combine, play and then save

  1. Load language once and do not load again as it takes some time (approx. 1-2.5 secs).
  2. Moreover the self induced constraint (assumption) here is that we won't switch language in between
  3. Check how long does it take to generate / process a chunk of audio; first being a single line sentence. Then play it from memory (from audio data buffer) and then save it to disk for reviewing ...
  4. Then process 1st multi-line text (with pargraphs in it). Then play it from memory (from audio data buffer) and then save it to disk for reviewing ...
  5. Then process the 2nd multi-line text (also with pargraphs in it). Then also play it from memory (from audio data buffer) and then save it to disk for reviewing ...
  6. Each tie tts is carried out (text is processed and audio data buffer is generated), the pipeline wasn't recreated. the pipeline for pure kokoro, only needs to be created when language needs to be changed. voice can be changed and for that pipeline doesn;t need to be initiated (whihc is tiny bot time consumeing)
  7. Extra ๐Ÿ˜‰: I added audio transformation for stream playback and file saving. Meaning, you an match your sound card's samplerate, bitrate, can control gain, and specify outchannel/channels all from within the code, under the hood it uses sounddevice.Stream API.
PYTORCH_ENABLE_MPS_FALLBACK=1 python3 examples/01_kk_play_save.py

Result

# Pt. 1 
Loading pipeline ...
Process took:   2.333237 secs.

...

# Pt. 3. (Single line with first time pipeline loaded)
Processing Single Line Text
Initializing generator Process took:    0.000012 secs.
Chunk 1 creation took:                  0.000024 secs.
Streaming audio... (audio file length): 5.547685 secs.
Saving file to disk took:               0.006516 secs.
file size:                              259 KB

# Pt. 4 (Multi line sentences with reusing pipeline)
Initializing generator Process took:    0.000006 secs
Chunk 1 of 3 took:                      0.000029 secs.
Chunk 2 of 3 took:                      0.000024 secs.
Chunk 3 of 3 took:                      0.000028 secs
Combining took:                         0.000361 secs.
Streaming audio... (combined audio file length):  34.360400 secs.
Saving file to disk took:               0.013127 secs.      
file size:                              1.6 MB

# Pt. 5 Multi line sentences with the previous context while reusing the 1st loaded pipeline)
Initializing generator Process took:    0.000007 secs.
Chunk 1 of 3 took:                      0.000023 secs.
Chunk 2 of 3 took:                      0.000022 secs.
Chunk 3 of 3 took:                      0.000024 secs.
Chunk 3 of 3 took:                      0.000030 secs.
Combining took:                         0.000704 secs.
Streaming audio... (combined audio file length):  43.408353 secs.
Saving file to disk took:               0.015742 secs.      
file size:                              2.1 MB

Check Figma for more details


Extra

cd extra
PYTORCH_ENABLE_MPS_FALLBACK=1 python3 main_kokoro_intractive.py

Interactive Commands

When running the application, you can use these commands:

  1. lang? or l? - Display currently set language code
  2. voice? or v? - Display currently set voice
  3. playback? or p? - Display playback options
  4. set lang to [code] - Change language code to one from the list
  5. set voice to [name] - Change voice to one from the list
  6. set playback to [mode] - Change playback mode (file or stream)

Notes

  1. In playback mode file, it saves the generated audio data to a file and plays back on default soundcard / playback device
  2. In playback mode stream, it saves the generated audio data too but before that plays back the data from the memory through the default soundcard / playback device. TBH, difference is not at all significant.

In file mode, in one example, total time from generating to file saving to the file file laoding tok 0.0032 sec

Current Settings:
        Language: h
        Voice: hf_alpha
        Speed: 1
        Playback: file


...
Initializing generator ...
Process took:   0.0000 secs.
Generator initialized with voice: hf_alpha 
Speed: 1
[0] TEXT(graphemes):     It was a Sprawl voice and a Sprawl joke. The Chatsubo was a bar for professional expatriates; you could drink there for a week and never hear two words in Japanese.

        (Phonemes):      ษชt wสŒzษ spษนหˆษ”หl vหˆYs and ษ spษนหˆษ”หl สคหˆQk. รฐษ™ สงatsjหˆuหbQ wสŒzษ bหˆษ‘ห fษ”ห pษนษ™fหˆษ›สƒษ™nษ™l ษ›kspหˆatษนษชหŒAts; juห kสŠd dษนหˆษชล‹k รฐeษ™ fษ™ษนษ™ wหˆiหk and nหˆษ›vษ™ hหˆiษ™ tหˆuห wหˆษœหdz ษชn สคหŒapษ™nหˆiหz.

Writing audio file (For immediate playback):     audio_exports/0hf_alpha.wav
Writing audio file (For immediate playback):     audio_exports/0hf_alpha.wav
Process took:   0.0032 secs.
Success โœ…
Playing audio file:      audio_exports/0hf_alpha.wav
....

And in, stream mode, it took 0.0011 sec (incl. file saving in the background ...)

Initializing generator ...
Process took:   0.0000 secs.
Generator initialized with voice: hf_alpha 
Speed: 1
[0] TEXT(graphemes):     It was a Sprawl voice and a Sprawl joke. The Chatsubo was a bar for professional expatriates; you could drink there for a week and never hear two words in Japanese.

        (Phonemes):      ษชt wสŒzษ spษนหˆษ”หl vหˆYs and ษ spษนหˆษ”หl สคหˆQk. รฐษ™ สงatsjหˆuหbQ wสŒzษ bหˆษ‘ห fษ”ห pษนษ™fหˆษ›สƒษ™nษ™l ษ›kspหˆatษนษชหŒAts; juห kสŠd dษนหˆษชล‹k รฐeษ™ fษ™ษนษ™ wหˆiหk and nหˆษ›vษ™ hหˆiษ™ tหˆuห wหˆษœหdz ษชn สคหŒapษ™nหˆiหz.

Writing audio file (for recording purposes):     audio_exports/0hf_alpha.wav
Process took:   0.0017 secs.
Success โœ…
Streaming audio...

Usage 2: Basic Usage For pure kokoro-onnx with gradio-gui

ONNX_PROVIDER="CoreMLExecutionProvider" python3 kokoro_onnx_gradio.py

alt text

Usage 2: Scripted bare metal usage for kokoro-onnx

# On mac
ONNX_PROVIDER="CoreMLExecutionProvider" python3 examples/02_kk_onnx_play_save.py

Result

Model loading took:     3.503696 secs.
ONNX model loaded with provider: CoreMLExecutionProvider
Loading voice data from: /Users/saurabhdatta/Documents/Projects/VW/ArtificialAugmentation2025/tts_tests/kokoro_test/onnx_deps/voices-v1.0.bin
Voice loading took:     0.002724 secs.


Processing Single Line Text
Generating audio for text:
The sky above the port was the color of television, tuned to a dead channel.

Audio generation took:  2.039143 secs.
Playing audio on 2 channels at 48000Hz | Duration: 4.33 seconds
...
Saving audio file...
Saving took:    0.011106 secs.


Processing Multi-Line Text 1
Generating audio for text:
Once upon a time, there was a little girl who lived in a village near the forest. Whenever she went out, the little girl wore a red riding cloak, so everyone in the village called her Little Red Riding Hood.

One morning, Little Red Riding Hood asked her mother if she could go to visit her grandmother as it had been awhile since they'd seen each other.

"That's a good idea," her mother said. "It's such a lovely day for a walk in the forest. Take this basket of fresh bread and butter to your grandmother, and remember - don't talk to strangers on the way!"

Audio generation took:  5.621120 secs.
Playing audio on 2 channels at 48000Hz | Duration: 30.14 seconds
...
Saving audio file...
Saving took:    0.030169 secs.


Processing Multi-Line Text 2
Generating audio for text:
Little Red Riding Hood promised to be careful and set off immediately. The forest was dense and deep, with sunlight filtering through the leaves. Birds sang cheerfully as she walked along the path.

Suddenly, she met a wolf. "Hello, little girl," said the wolf in a voice as sweet as honey. "Where are you going all alone in the woods?"

Little Red Riding Hood didn't know that wolves could be dangerous, so she replied, "I'm going to visit my grandmother who lives on the other side of the forest."

The wolf smiled wickedly. "What a coincidence! I was just heading that way myself. Why don't you take the long path with all the beautiful flowers? I'll take the short path and meet you there."


Audio generation took:  6.558173 secs.
Starting playback | Duration: 37.95 seconds
...
Saving audio file...
Saving took:    0.033002 secs.

๐Ÿ’ก For more detailed benchmark side by side, please checkout the figma space ...

Voices Summary

Lang codes

# ๐Ÿ‡บ๐Ÿ‡ธ 'a' => American English,  'b' => British English
#  'e' => Spanish es
#  'f' => French fr-fr
#  'h' => Hindi hi
#  'i' => Italian it
#  'j' => Japanese: pip install misaki[ja]
#  'p' => Brazilian Portuguese pt-br
#  'z' => Mandarin Chinese: pip install misaki[zh]
Kokoro Code Standard Language Code Language Description
a en-us ๐Ÿ‡บ๐Ÿ‡ธ American English
b en-gb ๐Ÿ‡ฌ๐Ÿ‡ง British English
e es ๐Ÿ‡ช๐Ÿ‡ธ Spanish (Spain)
f fr-fr ๐Ÿ‡ซ๐Ÿ‡ท French (France)
h hi ๐Ÿ‡ฎ๐Ÿ‡ณ Hindi
i it ๐Ÿ‡ฎ๐Ÿ‡น Italian
j ja ๐Ÿ‡ฏ๐Ÿ‡ต Japanese
p pt-br ๐Ÿ‡ง๐Ÿ‡ท Brazilian Portuguese
z zh ๐Ÿ‡จ๐Ÿ‡ณ Mandarin Chinese
Name Traits Target Quality Training Duration Overall Grade SHA256
af_heart ๐Ÿšบโค๏ธ A 0ab5709b
af_alloy ๐Ÿšบ B MM minutes C 6d877149
af_aoede ๐Ÿšบ B H hours C+ c03bd1a4
af_bella ๐Ÿšบ๐Ÿ”ฅ A HH hours A- 8cb64e02
af_jessica ๐Ÿšบ C MM minutes D cdfdccb8
af_kore ๐Ÿšบ B H hours C+ 8bfbc512
af_nicole ๐Ÿšบ๐ŸŽง B HH hours B- c5561808
af_nova ๐Ÿšบ B MM minutes C e0233676
af_river ๐Ÿšบ C MM minutes D e149459b
af_sarah ๐Ÿšบ B H hours C+ 49bd364e
af_sky ๐Ÿšบ B M minutes ๐Ÿค C- c799548a
am_adam ๐Ÿšน D H hours F+ ced7e284
am_echo ๐Ÿšน C MM minutes D 8bcfdc85
am_eric ๐Ÿšน C MM minutes D ada66f0e
am_fenrir ๐Ÿšน B H hours C+ 98e507ec
am_liam ๐Ÿšน C MM minutes D c8255075
am_michael ๐Ÿšน B H hours C+ 9a443b79
am_onyx ๐Ÿšน C MM minutes D e8452be1
am_puck ๐Ÿšน B H hours C+ dd1d8973
am_santa ๐Ÿšน C M minutes ๐Ÿค D- 7f2f7582

More VOICE deatils (from hexgrad - maintainers of kokoro in HF)

For each voice, the given grades are intended to be estimates of the quality and quantity of its associated training data, both of which impact overall inference quality.

Subjectively, voices will sound better or worse to different people.

Support for non-English languages may be absent or thin due to weak G2P and/or lack of training data. Some languages are only represented by a small handful or even just one voice (French).

Most voices perform best on a "goldilocks range" of 100-200 tokens out of ~500 possible. Voices may perform worse at the extremes:

  • Weakness on short utterances, especially less than 10-20 tokens. Root cause could be lack of short-utterance training data and/or model architecture. One possible inference mitigation is to bundle shorter utterances together.
  • Rushing on long utterances, especially over 400 tokens. You can chunk down to shorter utterances or adjust the speed parameter to mitigate this.

Target Quality

  • How high quality is the reference voice? This grade may be impacted by audio quality, artifacts, compression, & sample rate.
  • How well do the text labels match the audio? Text/audio misalignment (e.g. from hallucinations) will lower this grade.

Training Duration

  • How much audio was seen during training? Smaller durations result in a lower overall grade.
  • 10 hours <= HH hours < 100 hours
  • 1 hour <= H hours < 10 hours
  • 10 minutes <= MM minutes < 100 minutes
  • 1 minute <= M minutes ๐Ÿค < 10 minutes

American English

  • lang_code='a' in misaki[en]
  • espeak-ng en-us fallback
Name Traits Target Quality Training Duration Overall Grade SHA256
af_heart ๐Ÿšบโค๏ธ A 0ab5709b
af_alloy ๐Ÿšบ B MM minutes C 6d877149
af_aoede ๐Ÿšบ B H hours C+ c03bd1a4
af_bella ๐Ÿšบ๐Ÿ”ฅ A HH hours A- 8cb64e02
af_jessica ๐Ÿšบ C MM minutes D cdfdccb8
af_kore ๐Ÿšบ B H hours C+ 8bfbc512
af_nicole ๐Ÿšบ๐ŸŽง B HH hours B- c5561808
af_nova ๐Ÿšบ B MM minutes C e0233676
af_river ๐Ÿšบ C MM minutes D e149459b
af_sarah ๐Ÿšบ B H hours C+ 49bd364e
af_sky ๐Ÿšบ B M minutes ๐Ÿค C- c799548a
am_adam ๐Ÿšน D H hours F+ ced7e284
am_echo ๐Ÿšน C MM minutes D 8bcfdc85
am_eric ๐Ÿšน C MM minutes D ada66f0e
am_fenrir ๐Ÿšน B H hours C+ 98e507ec
am_liam ๐Ÿšน C MM minutes D c8255075
am_michael ๐Ÿšน B H hours C+ 9a443b79
am_onyx ๐Ÿšน C MM minutes D e8452be1
am_puck ๐Ÿšน B H hours C+ dd1d8973
am_santa ๐Ÿšน C M minutes ๐Ÿค D- 7f2f7582

British English

  • lang_code='b' in misaki[en]
  • espeak-ng en-gb fallback
Name Traits Target Quality Training Duration Overall Grade SHA256
bf_alice ๐Ÿšบ C MM minutes D d292651b
bf_emma ๐Ÿšบ B HH hours B- d0a423de
bf_isabella ๐Ÿšบ B MM minutes C cdd4c370
bf_lily ๐Ÿšบ C MM minutes D 6e09c2e4
bm_daniel ๐Ÿšน C MM minutes D fc3fce4e
bm_fable ๐Ÿšน B MM minutes C d44935f3
bm_george ๐Ÿšน B MM minutes C f1bc8122
bm_lewis ๐Ÿšน C H hours D+ b5204750

Japanese

  • lang_code='j' in misaki[ja]
  • Total Japanese training data: H hours
Name Traits Target Quality Training Duration Overall Grade SHA256 CC BY
jf_alpha ๐Ÿšบ B H hours C+ 1bf4c9dc
jf_gongitsune ๐Ÿšบ B MM minutes C 1b171917 gongitsune
jf_nezumi ๐Ÿšบ B M minutes ๐Ÿค C- d83f007a nezuminoyomeiri
jf_tebukuro ๐Ÿšบ B MM minutes C 0d691790 tebukurowokaini
jm_kumo ๐Ÿšน B M minutes ๐Ÿค C- 98340afd kumonoito

Mandarin Chinese

  • lang_code='z' in misaki[zh]
  • Total Mandarin Chinese training data: H hours
Name Traits Target Quality Training Duration Overall Grade SHA256
zf_xiaobei ๐Ÿšบ C MM minutes D 9b76be63
zf_xiaoni ๐Ÿšบ C MM minutes D 95b49f16
zf_xiaoxiao ๐Ÿšบ C MM minutes D cfaf6f2d
zf_xiaoyi ๐Ÿšบ C MM minutes D b5235dba
zm_yunjian ๐Ÿšน C MM minutes D 76cbf8ba
zm_yunxi ๐Ÿšน C MM minutes D dbe6e1ce
zm_yunxia ๐Ÿšน C MM minutes D bb2b03b0
zm_yunyang ๐Ÿšน C MM minutes D 5238ac22

Spanish

Name Traits SHA256
ef_dora ๐Ÿšบ d9d69b0f
em_alex ๐Ÿšน 5eac53f7
em_santa ๐Ÿšน aa8620cb

French

  • lang_code='f' in misaki[en]
  • espeak-ng fr-fr
  • Total French training data: <11 hours
Name Traits Target Quality Training Duration Overall Grade SHA256 CC BY
ff_siwis ๐Ÿšบ B <11 hours B- 8073bf2d SIWIS

Hindi

  • lang_code='h' in misaki[en]
  • espeak-ng hi
  • Total Hindi training data: H hours
Name Traits Target Quality Training Duration Overall Grade SHA256
hf_alpha ๐Ÿšบ B MM minutes C 06906fe0
hf_beta ๐Ÿšบ B MM minutes C 63c0a1a6
hm_omega ๐Ÿšน B MM minutes C b55f02a8
hm_psi ๐Ÿšน B MM minutes C 2f0f055c

Italian

  • lang_code='i' in misaki[en]
  • espeak-ng it
  • Total Italian training data: H hours
Name Traits Target Quality Training Duration Overall Grade SHA256
if_sara ๐Ÿšบ B MM minutes C 6c0b253b
im_nicola ๐Ÿšน B MM minutes C 234ed066

Brazilian Portuguese

Name Traits SHA256
pf_dora ๐Ÿšบ 07e4ff98
pm_alex ๐Ÿšน cf0ba8c5
pm_santa ๐Ÿšน d4210316

Training

Why on earth you would wanna do that ... ๐Ÿค” ?

Training Costs v0.19 v1.0 Total
In A100 80GB GPU hours 500 500 1000
Average hourly rate $0.80/h $1.20/h $1/h
In USD $400 $600 $1000