---
title: BARK Text to Audio with Batch Inference
emoji: 🪄
colorFrom: purple
colorTo: pink
sdk: gradio
python_version: "3.10.13"
sdk_version: "5.23.3"
suggested_hardware: cpu-upgrade
suggested_storage: small
app_file: app.py
short_description: Generate natural sounding speech audio from text
pinned: true
startup_duration_timeout: 45m
# models:
#     - mistralai/Mistral-7B-Instruct-v0.2
#     - stabilityai/stable-diffusion-xl-base-1.0
tags:
    - text-to-audio
    - gradio
    - bark
preload_from_hub:
    - suno/bark
---

# Generate Audio from text and clone voice with BARK

You can generate audio from text with natural sounding voice and clone any voice (not perfect).
![Screenshot Placeholder](./assets/images/screenshot.png)

Code worked on Python 3.12. May also work on other versions.

Example generated audio in the /assets/audio folder

## Features

-   **Text-to-Audio Generation:** Generate speech from text using the BARK model (supports 'small' and 'large' variants).
-   **Parameter Control:** Adjust semantic, coarse, and fine temperature settings for generation diversity. Set a generation seed for reproducibility.
-   **Device Selection:** Run inference on available devices (CPU, CUDA, MPS).
-   **Standard Voice Prompts:** Utilize built-in BARK voice prompts (`.npz` files) located in the `bark_prompts` directory.
-   **Custom Voice Prompt Creation (Voice Cloning):**
    -   Upload your own audio file (.wav, .mp3).
    -   Generate a BARK-compatible semantic prompt (`.npz` file) using a custom-trained HuBERT model.
    -   The generated prompt appears in the "Select Voice Prompt" dropdown for immediate use.
-   **Audio Management:** View, play, and delete generated audio files directly within the interface.
-   **Training Scripts:** Includes scripts to generate the necessary dataset (`generate_audio_semantic_dataset.py`) and train the custom HuBERT model (`train_hubert.py`).

## Custom Voice Cloning Model

The core of the custom voice prompt generation relies on a fine-tuned HuBERT model.

-   **Model:** `sleeper371/hubert-for-bark-semantic` on Hugging Face ([Link](https://huggingface.co/sleeper371/hubert-for-bark-semantic))
-   **Architecture:** This model uses a HuBERT base feature extractor followed by a Transformer decoder head.
-   **Training:** It was trained on over 4700 sentence pairs, mapping audio waveforms to the semantic tokens generated by BARK's semantic model. The training used a cross-entropy loss objective.
-   **Dataset:** The training dataset is available at `sleeper371/bark-wave-semantic` on Hugging Face ([Link](https://huggingface.co/datasets/sleeper371/bark-wave-semantic)).
-   **Comparison:** This approach is inspired by projects like [gitmylo/bark-data-gen](https://github.com/gitmylo/bark-data-gen), but differs in the head architecture (he used an LSTM head while I used a transformers decoder head)

## Setup and Installation

Follow these steps to set up the environment and run the application.

1.  **Clone the Repository:**

2.  **Create a Virtual Environment:**
    It's highly recommended to use a virtual environment to manage dependencies.

    ```bash
    # For Linux/macOS
    python3 -m venv venv
    source venv/bin/activate

    # For Windows
    python -m venv venv
    .\venv\Scripts\activate
    ```

3.  **Install Requirements:**
    Make sure you have a `requirements.txt` file in the repository root containing all necessary packages (e.g., `gradio`, `torch`, `transformers`, `soundfile`, etc.).
    ```bash
    pip install -r requirements.txt
    ```

## Running the Application

Once the setup is complete, run the Gradio application:

```bash
python app.py
```

This will launch the Gradio interface, typically accessible at http://127.0.0.1:7860 in your web browser. The console output will provide the exact URL.

## Training Your Own Custom HuBERT Model

If you want to train your own HuBERT model for voice cloning:

1. Generate Dataset:

-   Use the generate_audio_semantic_dataset.py script.

2. Train the Model:

-   Use the train_hubert.py script.

-   This script takes the generated dataset (audio paths and semantic token paths) to fine-tune a HuBERT model with a Transformer decoder head.

-   Configure training parameters (batch size, learning rate, epochs, output directory) within the script or via command-line arguments (if implemented).

## License

MIT

## Acknowledgements

-   Suno AI, they trained the models

-   gitmylo, inspired me to use HuBERT to predict semantic tokens from audio