dia-tts-server / documentation.md
Michael Hu
initial check in of the dia tts server
ac5de5b
# Dia TTS Server - Technical Documentation
**Version:** 1.0.0
**Date:** 2025-04-22
**Table of Contents:**
1. [Overview](#1-overview)
2. [Visual Overview](#2-visual-overview)
* [Directory Structure](#21-directory-structure)
* [Component Diagram](#22-component-diagram)
3. [System Prerequisites](#3-system-prerequisites)
4. [Installation and Setup](#4-installation-and-setup)
* [Cloning the Repository](#41-cloning-the-repository)
* [Setting up Python Virtual Environment](#42-setting-up-python-virtual-environment)
* [Windows Setup](#421-windows-setup)
* [Linux Setup (Debian/Ubuntu Example)](#422-linux-setup-debianubuntu-example)
* [Installing Dependencies](#43-installing-dependencies)
* [NVIDIA Driver and CUDA Setup (Required for GPU Acceleration)](#44-nvidia-driver-and-cuda-setup-required-for-gpu-acceleration)
* [Step 1: Check/Install NVIDIA Drivers](#441-step-1-checkinstall-nvidia-drivers)
* [Step 2: Install PyTorch with CUDA Support](#442-step-2-install-pytorch-with-cuda-support)
* [Step 3: Verify PyTorch CUDA Installation](#443-step-3-verify-pytorch-cuda-installation)
5. [Configuration](#5-configuration)
* [Configuration Files (`.env` and `config.py`)](#51-configuration-files-env-and-configpy)
* [Configuration Parameters](#52-configuration-parameters)
6. [Running the Server](#6-running-the-server)
7. [Usage](#7-usage)
* [Web User Interface (Web UI)](#71-web-user-interface-web-ui)
* [Main Generation Form](#711-main-generation-form)
* [Presets](#712-presets)
* [Voice Cloning](#713-voice-cloning)
* [Generation Parameters](#714-generation-parameters)
* [Server Configuration (UI)](#715-server-configuration-ui)
* [Generated Audio Player](#716-generated-audio-player)
* [Theme Toggle](#717-theme-toggle)
* [API Endpoints](#72-api-endpoints)
* [POST /v1/audio/speech (OpenAI Compatible)](#721-post-v1audiospeech-openai-compatible)
* [POST /tts (Custom Parameters)](#722-post-tts-custom-parameters)
* [Configuration & Helper Endpoints](#723-configuration--helper-endpoints)
8. [Troubleshooting](#8-troubleshooting)
9. [Project Architecture](#9-project-architecture)
10. [License and Disclaimer](#10-license-and-disclaimer)
---
## 1. Overview
The Dia TTS Server provides a backend service and web interface for generating high-fidelity speech, including dialogue with multiple speakers and non-verbal sounds, using the Dia text-to-speech model family (originally from Nari Labs, with support for community conversions like SafeTensors).
This server is built using the FastAPI framework and offers both a RESTful API (including an OpenAI-compatible endpoint) and an interactive web UI powered by Jinja2, Tailwind CSS, and JavaScript. It supports voice cloning via audio prompts and allows configuration of various generation parameters.
**Key Features:**
* **High-Quality TTS:** Leverages the Dia model for realistic speech synthesis.
* **Dialogue Generation:** Supports `[S1]` and `[S2]` tags for multi-speaker dialogue.
* **Non-Verbal Sounds:** Can generate sounds like `(laughs)`, `(sighs)`, etc., when included in the text.
* **Voice Cloning:** Allows conditioning the output voice on a provided reference audio file.
* **Flexible Model Loading:** Supports loading models from Hugging Face repositories, including both `.pth` and `.safetensors` formats (defaults to BF16 SafeTensors for efficiency).
* **API Access:** Provides a custom API endpoint (`/tts`) and an OpenAI-compatible endpoint (`/v1/audio/speech`).
* **Web Interface:** Offers an easy-to-use UI for text input, parameter adjustment, preset loading, reference audio management, and audio playback.
* **Configuration:** Server settings, model sources, paths, and default generation parameters are configurable via an `.env` file.
* **GPU Acceleration:** Utilizes NVIDIA GPUs via CUDA for significantly faster inference when available, falling back to CPU otherwise.
---
## 2. Visual Overview
### 2.1 Directory Structure
```
dia-tts-server/
β”‚
β”œβ”€β”€ .env # Local configuration overrides (user-created)
β”œβ”€β”€ config.py # Default configuration and management class
β”œβ”€β”€ engine.py # Core model loading and generation logic
β”œβ”€β”€ models.py # Pydantic models for API requests
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ server.py # Main FastAPI application, API endpoints, UI routes
β”œβ”€β”€ utils.py # Utility functions (audio encoding, saving, etc.)
β”‚
β”œβ”€β”€ dia/ # Core Dia model implementation package
β”‚ β”œβ”€β”€ __init__.py
β”‚ β”œβ”€β”€ audio.py # Audio processing helpers (delay, codebook conversion)
β”‚ β”œβ”€β”€ config.py # Pydantic models for Dia model architecture config
β”‚ β”œβ”€β”€ layers.py # Custom PyTorch layers for the Dia model
β”‚ └── model.py # Dia model class wrapper (loading, generation)
β”‚
β”œβ”€β”€ static/ # Static assets (e.g., favicon.ico)
β”‚ └── favicon.ico
β”‚
β”œβ”€β”€ ui/ # Web User Interface files
β”‚ β”œβ”€β”€ index.html # Main HTML template (Jinja2)
β”‚ β”œβ”€β”€ presets.yaml # Predefined UI examples
β”‚ β”œβ”€β”€ script.js # Frontend JavaScript logic
β”‚ └── style.css # Frontend CSS styling (Tailwind via CDN/build)
β”‚
β”œβ”€β”€ model_cache/ # Default directory for downloaded model files (configurable)
β”œβ”€β”€ outputs/ # Default directory for saved audio output (configurable)
└── reference_audio/ # Default directory for voice cloning reference files (configurable)
```
### 2.2 Component Diagram
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ User (Web UI / │────→ β”‚ FastAPI Server │────→ β”‚ TTS Engine │────→ β”‚ Dia Model Wrapper β”‚
β”‚ API Client) β”‚ β”‚ (server.py) β”‚ β”‚ (engine.py) β”‚ β”‚ (dia/model.py) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚ β”‚
β”‚ Uses β”‚ Uses β”‚ Uses
β–Ό β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Configuration β”‚ ←─── β”‚ .env File β”‚ β”‚ Dia Model Layers β”‚
β”‚ (config.py) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ (dia/layers.py) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚ Uses
β”‚ Uses β”‚
β–Ό β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Uses
β”‚ Utilities β”‚ β–Ό
β”‚ (utils.py) β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ PyTorch / CUDA β”‚
β–² β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ Uses β”‚ Uses
β”‚ β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Web UI Files β”‚ ←─── β”‚ Jinja2 Templates β”‚ β”‚ DAC Model β”‚
β”‚ (ui/) β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ (descript-audio..)β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β–² β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ Renders β–²
β”‚ β”‚ Uses
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
**Diagram Legend:**
* Boxes represent major components or file groups.
* Arrows (`β†’`) indicate primary data flow or control flow.
* Lines with "Uses" indicate dependencies or function calls.
---
## 3. System Prerequisites
Before installing and running the Dia TTS Server, ensure your system meets the following requirements:
* **Operating System:**
* Windows 10/11 (64-bit)
* Linux (Debian/Ubuntu recommended, other distributions may require adjustments)
* **Python:** Python 3.10 or later (Python 3.10.x recommended based on tracebacks). Ensure Python and Pip are added to your system's PATH.
* **Version Control:** Git (for cloning the repository).
* **Internet Connection:** Required for downloading dependencies and model files.
* **(Optional but Highly Recommended for Performance):**
* **NVIDIA GPU:** A CUDA-compatible NVIDIA GPU (Maxwell architecture or newer). Check compatibility [here](https://developer.nvidia.com/cuda-gpus). Sufficient VRAM is needed (BF16 model requires ~5-6GB, full precision ~10GB).
* **NVIDIA Drivers:** Latest appropriate drivers for your GPU and OS.
* **CUDA Toolkit:** Version compatible with the chosen PyTorch build (e.g., 11.8, 12.1). See [Section 4.4](#44-nvidia-driver-and-cuda-setup-required-for-gpu-acceleration).
* **(Linux System Libraries):**
* `libsndfile1`: Required by the `soundfile` Python library for audio I/O. Install using your package manager (e.g., `sudo apt install libsndfile1` on Debian/Ubuntu).
---
## 4. Installation and Setup
Follow these steps to set up the project environment and install necessary dependencies.
### 4.1 Cloning the Repository
Open your terminal or command prompt and navigate to the directory where you want to store the project. Then, clone the repository:
```bash
git clone https://github.com/devnen/dia-tts-server.git # Replace with the actual repo URL if different
cd dia-tts-server
```
### 4.2 Setting up Python Virtual Environment
Using a virtual environment is strongly recommended to isolate project dependencies.
#### 4.2.1 Windows Setup
1. **Open PowerShell or Command Prompt** in the project directory (`dia-tts-server`).
2. **Create the virtual environment:**
```powershell
python -m venv venv
```
3. **Activate the virtual environment:**
```powershell
.\venv\Scripts\activate
```
Your terminal prompt should now be prefixed with `(venv)`.
#### 4.2.2 Linux Setup (Debian/Ubuntu Example)
1. **Install prerequisites (if not already present):**
```bash
sudo apt update
sudo apt install python3 python3-venv python3-pip libsndfile1 -y
```
2. **Open your terminal** in the project directory (`dia-tts-server`).
3. **Create the virtual environment:**
```bash
python3 -m venv venv
```
4. **Activate the virtual environment:**
```bash
source venv/bin/activate
```
Your terminal prompt should now be prefixed with `(venv)`.
### 4.3 Installing Dependencies
With your virtual environment activated (`(venv)` prefix visible), install the required Python packages:
```bash
# Upgrade pip first (optional but good practice)
pip install --upgrade pip
# Install all dependencies from requirements.txt
pip install -r requirements.txt
```
**Note:** This command installs the CPU-only version of PyTorch by default. If you have a compatible NVIDIA GPU and want acceleration, proceed to [Section 4.4](#44-nvidia-driver-and-cuda-setup-required-for-gpu-acceleration) **before** running the server.
### 4.4 NVIDIA Driver and CUDA Setup (Required for GPU Acceleration)
Follow these steps **only if you have a compatible NVIDIA GPU** and want faster inference.
#### 4.4.1 Step 1: Check/Install NVIDIA Drivers
1. **Check Existing Driver:** Open Command Prompt (Windows) or Terminal (Linux) and run:
```bash
nvidia-smi
```
2. **Interpret Output:**
* If the command runs successfully, note the **Driver Version** and the **CUDA Version** listed in the top right corner. This CUDA version is the *maximum* supported by your current driver.
* If the command fails ("not recognized"), you need to install or update your NVIDIA drivers.
3. **Install/Update Drivers:** Go to the [NVIDIA Driver Downloads](https://www.nvidia.com/Download/index.aspx) page. Select your GPU model and OS, then download and install the latest recommended driver (Game Ready or Studio). **Reboot your computer** after installation. Run `nvidia-smi` again to confirm it works.
#### 4.4.2 Step 2: Install PyTorch with CUDA Support
1. **Go to PyTorch Website:** Visit [https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/).
2. **Configure:** Select:
* **PyTorch Build:** Stable
* **Your OS:** Windows or Linux
* **Package:** Pip
* **Language:** Python
* **Compute Platform:** Choose the CUDA version **equal to or lower than** the version reported by `nvidia-smi`. For example, if `nvidia-smi` shows `CUDA Version: 12.4`, select `CUDA 12.1`. If it shows `11.8`, select `CUDA 11.8`. **Do not select a version higher than your driver supports.** (CUDA 12.1 or 11.8 are common stable choices).
3. **Copy Command:** Copy the generated installation command. It will look similar to:
```bash
# Example for CUDA 12.1 (Windows/Linux):
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Example for CUDA 11.8 (Windows/Linux):
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
```
*(Use `pip` instead of `pip3` if that's your command)*
4. **Install in Activated venv:**
* Ensure your `(venv)` is active.
* **Uninstall CPU PyTorch first:**
```bash
pip uninstall torch torchvision torchaudio -y
```
* **Paste and run the copied command** from the PyTorch website.
#### 4.4.3 Step 3: Verify PyTorch CUDA Installation
1. With the `(venv)` still active, start a Python interpreter:
```bash
python
```
2. Run the following Python code:
```python
import torch
print(f"PyTorch version: {torch.__version__}")
cuda_available = torch.cuda.is_available()
print(f"CUDA available: {cuda_available}")
if cuda_available:
print(f"CUDA version used by PyTorch: {torch.version.cuda}")
print(f"Device count: {torch.cuda.device_count()}")
print(f"Current device index: {torch.cuda.current_device()}")
print(f"Device name: {torch.cuda.get_device_name(torch.cuda.current_device())}")
else:
print("CUDA not available to PyTorch. Ensure drivers and CUDA-enabled PyTorch are installed correctly.")
exit()
```
3. If `CUDA available:` shows `True`, the setup was successful. If `False`, review driver installation and the PyTorch installation command.
---
## 5. Configuration
The server's behavior, including model selection, paths, and default generation parameters, is controlled via configuration settings.
### 5.1 Configuration Files (`.env` and `config.py`)
* **`config.py`:** Defines the *default* values for all configuration parameters in the `DEFAULT_CONFIG` dictionary. It also contains the `ConfigManager` class and getter functions used by the application.
* **`.env` File:** This file, located in the project root directory (`dia-tts-server/.env`), allows you to *override* the default values. Create this file if it doesn't exist. Settings are defined as `KEY=VALUE` pairs, one per line. The server reads this file on startup using `python-dotenv`.
**Priority:** Values set in the `.env` file take precedence over the defaults in `config.py`. Environment variables set directly in your system also override `.env` file values (though using `.env` is generally recommended for project-specific settings).
### 5.2 Configuration Parameters
The following parameters can be set in your `.env` file:
| Parameter Name (in `.env`) | Default Value (`config.py`) | Description | Example `.env` Value |
| :--------------------------------- | :--------------------------------- | :--------------------------------------------------------------------------------------------------------- | :----------------------------------- |
| **Server Settings** | | | |
| `HOST` | `0.0.0.0` | The network interface address the server listens on. `0.0.0.0` makes it accessible on your local network. | `127.0.0.1` (localhost only) |
| `PORT` | `8003` | The port number the server listens on. | `8080` |
| **Model Source Settings** | | | |
| `DIA_MODEL_REPO_ID` | `ttj/dia-1.6b-safetensors` | The Hugging Face repository ID containing the model files. | `nari-labs/Dia-1.6B` |
| `DIA_MODEL_CONFIG_FILENAME` | `config.json` | The filename of the model's configuration JSON within the repository. | `config.json` |
| `DIA_MODEL_WEIGHTS_FILENAME` | `dia-v0_1_bf16.safetensors` | The filename of the model weights file (`.safetensors` or `.pth`) within the repository to load. | `dia-v0_1.safetensors` or `dia-v0_1.pth` |
| **Path Settings** | | | |
| `DIA_MODEL_CACHE_PATH` | `./model_cache` | Local directory to store downloaded model files. Relative paths are based on the project root. | `/path/to/shared/cache` |
| `REFERENCE_AUDIO_PATH` | `./reference_audio` | Local directory to store reference audio files (`.wav`, `.mp3`) used for voice cloning. | `./voices` |
| `OUTPUT_PATH` | `./outputs` | Local directory where generated audio files from the Web UI are saved. | `./generated_speech` |
| **Default Generation Parameters** | | *(These set the initial UI values and can be saved via the UI)* | |
| `GEN_DEFAULT_SPEED_FACTOR` | `0.90` | Default playback speed factor applied *after* generation (UI slider initial value). | `1.0` |
| `GEN_DEFAULT_CFG_SCALE` | `3.0` | Default Classifier-Free Guidance scale (UI slider initial value). | `2.5` |
| `GEN_DEFAULT_TEMPERATURE` | `1.3` | Default sampling temperature (UI slider initial value). | `1.2` |
| `GEN_DEFAULT_TOP_P` | `0.95` | Default nucleus sampling probability (UI slider initial value). | `0.9` |
| `GEN_DEFAULT_CFG_FILTER_TOP_K` | `35` | Default Top-K value for CFG filtering (UI slider initial value). | `40` |
**Example `.env` File (Using Original Nari Labs Model):**
```dotenv
# .env
# Example configuration to use the original Nari Labs model
HOST=0.0.0.0
PORT=8003
DIA_MODEL_REPO_ID=nari-labs/Dia-1.6B
DIA_MODEL_CONFIG_FILENAME=config.json
DIA_MODEL_WEIGHTS_FILENAME=dia-v0_1.pth
# Keep other paths as default or specify custom ones
# DIA_MODEL_CACHE_PATH=./model_cache
# REFERENCE_AUDIO_PATH=./reference_audio
# OUTPUT_PATH=./outputs
# Keep default generation parameters or override them
# GEN_DEFAULT_SPEED_FACTOR=0.90
# GEN_DEFAULT_CFG_SCALE=3.0
# GEN_DEFAULT_TEMPERATURE=1.3
# GEN_DEFAULT_TOP_P=0.95
# GEN_DEFAULT_CFG_FILTER_TOP_K=35
```
**Important:** You must **restart the server** after making changes to the `.env` file for them to take effect.
---
## 6. Running the Server
1. **Activate Virtual Environment:** Ensure your virtual environment is activated (`(venv)` prefix).
* Windows: `.\venv\Scripts\activate`
* Linux: `source venv/bin/activate`
2. **Navigate to Project Root:** Make sure your terminal is in the `dia-tts-server` directory.
3. **Run the Server:**
```bash
python server.py
```
4. **Server Output:** You should see log messages indicating the server is starting, including:
* The configuration being used (repo ID, filenames, paths).
* The device being used (CPU or CUDA).
* Model loading progress (downloading if necessary).
* Confirmation that the server is running (e.g., `Uvicorn running on http://0.0.0.0:8003`).
* URLs for accessing the Web UI and API Docs.
5. **Accessing the Server:**
* **Web UI:** Open your web browser and go to `http://localhost:PORT` (e.g., `http://localhost:8003` if using the default port). If running on a different machine or VM, replace `localhost` with the server's IP address.
* **API Docs:** Access the interactive API documentation (Swagger UI) at `http://localhost:PORT/docs`.
6. **Stopping the Server:** Press `CTRL+C` in the terminal where the server is running.
**Auto-Reload:** The server is configured to run with `reload=True`. This means Uvicorn will automatically restart the server if it detects changes in `.py`, `.html`, `.css`, `.js`, `.env`, or `.yaml` files within the project or `ui` directory. This is useful for development but should generally be disabled in production.
---
## 7. Usage
The Dia TTS Server can be used via its Web UI or its API endpoints.
### 7.1 Web User Interface (Web UI)
Access the UI by navigating to the server's base URL (e.g., `http://localhost:8003`).
#### 7.1.1 Main Generation Form
* **Text to speak:** Enter the text you want to synthesize.
* Use `[S1]` and `[S2]` tags to indicate speaker turns for dialogue.
* Include non-verbal cues like `(laughs)`, `(sighs)`, `(clears throat)` directly in the text where desired.
* For voice cloning, **prepend the exact transcript** of the selected reference audio before the text you want generated (e.g., `[S1] Reference transcript text. [S1] This is the new text to generate in the cloned voice.`).
* **Voice Mode:** Select the desired generation mode:
* **Single / Dialogue (Use [S1]/[S2]):** Use this for single-speaker text (you can use `[S1]` or omit tags if the model handles it) or multi-speaker dialogue (using `[S1]` and `[S2]`).
* **Voice Clone (from Reference):** Enables voice cloning based on a selected audio file. Requires selecting a file below and prepending its transcript to the text input.
* **Generate Speech Button:** Submits the text and settings to the server to start generation.
#### 7.1.2 Presets
* Located below the Voice Mode selection.
* Clicking a preset button (e.g., "Standard Dialogue", "Expressive Narration") will automatically populate the "Text to speak" area and the "Generation Parameters" sliders with predefined values, demonstrating different use cases.
#### 7.1.3 Voice Cloning
* This section appears only when "Voice Clone" mode is selected.
* **Reference Audio File Dropdown:** Lists available `.wav` and `.mp3` files found in the configured `REFERENCE_AUDIO_PATH`. Select the file whose voice you want to clone. Remember to prepend its transcript to the main text input.
* **Load Button:** Click this to open your system's file browser. You can select one or more `.wav` or `.mp3` files to upload. The selected files will be copied to the server's `REFERENCE_AUDIO_PATH`, and the dropdown list will refresh automatically. The first newly uploaded file will be selected in the dropdown.
#### 7.1.4 Generation Parameters
* Expand this section to fine-tune the generation process. These values correspond to the parameters used by the underlying Dia model.
* **Sliders:** Adjust Speed Factor, CFG Scale, Temperature, Top P, and CFG Filter Top K. The current value is displayed next to the label.
* **Save Generation Defaults Button:** Saves the *current* values of these sliders to the `.env` file (as `GEN_DEFAULT_...` keys). These saved values will become the default settings loaded into the UI the next time the server starts.
#### 7.1.5 Server Configuration (UI)
* Expand this section to view and modify server-level settings stored in the `.env` file.
* **Fields:** Edit Model Repo ID, Config/Weights Filenames, Cache/Reference/Output Paths, Host, and Port.
* **Save Server Configuration Button:** Saves the values currently shown in these fields to the `.env` file. **A server restart is required** for most of these changes (especially model source or paths) to take effect.
* **Restart Server Button:** (Appears after saving) Attempts to trigger a server restart. This works best if the server was started with `reload=True` or is managed by a process manager like systemd or Supervisor.
#### 7.1.6 Generated Audio Player
* Appears below the main form after a successful generation.
* **Waveform:** Visual representation of the generated audio.
* **Play/Pause Button:** Controls audio playback.
* **Download WAV Button:** Downloads the generated audio as a `.wav` file.
* **Info:** Displays the voice mode used, generation time, and audio duration.
#### 7.1.7 Theme Toggle
* Located in the top-right navigation bar.
* Click the Sun/Moon icon to switch between Light and Dark themes. Your preference is saved in your browser's `localStorage`.
### 7.2 API Endpoints
Access the interactive API documentation via the `/docs` path (e.g., `http://localhost:8003/docs`).
#### 7.2.1 POST `/v1/audio/speech` (OpenAI Compatible)
* **Purpose:** Provides an endpoint compatible with the basic OpenAI TTS API for easier integration with existing tools.
* **Request Body:** (`application/json`) - Uses the `OpenAITTSRequest` model.
| Field | Type | Required | Description | Example |
| :---------------- | :----------------------- | :------- | :---------------------------------------------------------------------------------------------------------------------------------------- | :-------------------------- |
| `model` | string | No | Ignored by this server (always uses Dia). Included for compatibility. Defaults to `dia-1.6b`. | `"dia-1.6b"` |
| `input` | string | Yes | The text to synthesize. Use `[S1]`/`[S2]` tags for dialogue. For cloning, prepend reference transcript. | `"Hello [S1] world."` |
| `voice` | string | No | Maps to Dia modes. Use `"S1"`, `"S2"`, `"dialogue"`, or the filename of a reference audio (e.g., `"my_ref.wav"`) for cloning. Defaults to `S1`. | `"dialogue"` or `"ref.mp3"` |
| `response_format` | `"opus"` \| `"wav"` | No | Desired audio output format. Defaults to `opus`. | `"wav"` |
| `speed` | float | No | Playback speed factor (0.5-2.0). Applied *after* generation. Defaults to `1.0`. | `0.9` |
* **Response:**
* **Success (200 OK):** `StreamingResponse` containing the binary audio data (`audio/opus` or `audio/wav`).
* **Error:** Standard FastAPI JSON error response (e.g., 400, 404, 500).
#### 7.2.2 POST `/tts` (Custom Parameters)
* **Purpose:** Allows generation using all specific Dia generation parameters.
* **Request Body:** (`application/json`) - Uses the `CustomTTSRequest` model.
| Field | Type | Required | Description | Default |
| :------------------------- | :------------------------------------- | :------- | :---------------------------------------------------------------------------------------------------------------------------------------- | :---------- |
| `text` | string | Yes | The text to synthesize. Use `[S1]`/`[S2]` tags. Prepend transcript for cloning. | |
| `voice_mode` | `"dialogue"` \| `"clone"` | No | Generation mode. Note: `single_s1`/`single_s2` are handled via `dialogue` mode with appropriate tags in the text. | `dialogue` |
| `clone_reference_filename` | string \| null | No | Filename of reference audio in `REFERENCE_AUDIO_PATH`. **Required if `voice_mode` is `clone`**. | `null` |
| `output_format` | `"opus"` \| `"wav"` | No | Desired audio output format. | `opus` |
| `max_tokens` | integer \| null | No | Maximum audio tokens to generate. `null` uses the model's default. | `null` |
| `cfg_scale` | float | No | Classifier-Free Guidance scale. | `3.0` |
| `temperature` | float | No | Sampling temperature. | `1.3` |
| `top_p` | float | No | Nucleus sampling probability. | `0.95` |
| `speed_factor` | float | No | Playback speed factor (0.5-2.0). Applied *after* generation. | `0.90` |
| `cfg_filter_top_k` | integer | No | Top-K value for CFG filtering. | `35` |
* **Response:**
* **Success (200 OK):** `StreamingResponse` containing the binary audio data (`audio/opus` or `audio/wav`).
* **Error:** Standard FastAPI JSON error response (e.g., 400, 404, 500).
#### 7.2.3 Configuration & Helper Endpoints
* **GET `/get_config`:** Returns the current server configuration as JSON.
* **POST `/save_config`:** Saves server configuration settings provided in the JSON request body to the `.env` file. Requires server restart.
* **POST `/save_generation_defaults`:** Saves default generation parameters provided in the JSON request body to the `.env` file. Affects UI defaults on next load.
* **POST `/restart_server`:** Attempts to trigger a server restart (reliability depends on execution environment).
* **POST `/upload_reference`:** Uploads one or more audio files (`.wav`, `.mp3`) as `multipart/form-data` to the reference audio directory. Returns JSON with status and updated file list.
* **GET `/health`:** Basic health check endpoint. Returns `{"status": "healthy", "model_loaded": true/false}`.
---
## 8. Troubleshooting
* **Error: `CUDA available: False` or Slow Performance:**
* Verify NVIDIA drivers are installed correctly (`nvidia-smi` command).
* Ensure you installed the correct PyTorch version with CUDA support matching your driver (See [Section 4.4](#44-nvidia-driver-and-cuda-setup-required-for-gpu-acceleration)). Reinstall PyTorch using the command from the official website if unsure.
* Check if another process is using all GPU VRAM.
* **Error: `ImportError: No module named 'dac'` (or `safetensors`, `yaml`, etc.):**
* Make sure your virtual environment is activated.
* Run `pip install -r requirements.txt` again to install missing dependencies.
* Specifically for `dac`, ensure you installed `descript-audio-codec` and not a different package named `dac`. Run `pip uninstall dac -y && pip install descript-audio-codec`.
* **Error: `libsndfile library not found` (or similar `soundfile` error, mainly on Linux):**
* Install the system library: `sudo apt update && sudo apt install libsndfile1` (Debian/Ubuntu) or the equivalent for your distribution.
* **Error: Model Download Fails (e.g., `HTTPError`, `ConnectionError`):**
* Check your internet connection.
* Verify the `DIA_MODEL_REPO_ID`, `DIA_MODEL_CONFIG_FILENAME`, and `DIA_MODEL_WEIGHTS_FILENAME` in your `.env` file (or defaults in `config.py`) are correct and accessible on Hugging Face Hub.
* Check Hugging Face Hub status if multiple downloads fail.
* Ensure the cache directory (`DIA_MODEL_CACHE_PATH`) is writable.
* **Error: `RuntimeError: Failed to load DAC model...`:**
* This usually indicates an issue with the `descript-audio-codec` installation or version incompatibility. Ensure it's installed correctly (see `ImportError` above).
* Check logs for specific `AttributeError` messages (like missing `utils` or `download`) which might indicate version mismatches between the Dia code's expectation and the installed library. The current code expects `dac.utils.download()`.
* **Error: `FileNotFoundError` during generation (Reference Audio):**
* Ensure the filename selected/provided for voice cloning exists in the configured `REFERENCE_AUDIO_PATH`.
* Check that the path in `config.py` or `.env` is correct and the server has permission to read from it.
* **Error: Cannot Save Output/Reference Files (`PermissionError`, etc.):**
* Ensure the directories specified by `OUTPUT_PATH` and `REFERENCE_AUDIO_PATH` exist and the server process has write permissions to them.
* **Web UI Issues (Buttons don't work, styles missing):**
* Clear your browser cache.
* Check the browser's developer console (usually F12) for JavaScript errors.
* Ensure `ui/script.js` and `ui/style.css` are being loaded correctly (check network tab in developer tools).
* **Generation Cancel Button Doesn't Stop Process:**
* This is expected ("Fake Cancel"). The button currently only prevents the UI from processing the result when it eventually arrives. True cancellation is complex and not implemented. Clicking "Generate" again *will* cancel the *previous UI request's result processing* before starting the new one.
---
## 9. Project Architecture
* **`server.py`:** The main entry point using FastAPI. Defines API routes, serves the Web UI using Jinja2, handles requests, and orchestrates calls to the engine.
* **`engine.py`:** Responsible for loading the Dia model (including downloading files via `huggingface_hub`), managing the model instance, preparing inputs for the model's `generate` method based on user requests (handling voice modes), and calling the model's generation function. Also handles post-processing like speed adjustment.
* **`config.py`:** Manages all configuration settings using default values and overrides from a `.env` file. Provides getter functions for easy access to settings.
* **`dia/` package:** Contains the core implementation of the Dia model itself.
* `model.py`: Defines the `Dia` class, which wraps the underlying PyTorch model (`DiaModel`). It handles loading weights (`.pth` or `.safetensors`), loading the required DAC model, preparing inputs specifically for the `DiaModel` forward pass (including CFG logic), and running the autoregressive generation loop.
* `config.py` (within `dia/`): Defines Pydantic models representing the *structure* and hyperparameters of the Dia model architecture (encoder, decoder, data parameters). This is loaded from the `config.json` file associated with the model weights.
* `layers.py`: Contains custom PyTorch `nn.Module` implementations used within the `DiaModel` (e.g., Attention blocks, MLP blocks, RoPE).
* `audio.py`: Includes helper functions for audio processing specific to the model's tokenization and delay patterns (e.g., `audio_to_codebook`, `codebook_to_audio`, `apply_audio_delay`).
* **`ui/` directory:** Contains all files related to the Web UI.
* `index.html`: The main Jinja2 template.
* `script.js`: Frontend JavaScript for interactivity, API calls, theme switching, etc.
* `presets.yaml`: Definitions for the UI preset examples.
* **`utils.py`:** General utility functions, such as audio encoding (`encode_audio`) and saving (`save_audio_to_file`) using the `soundfile` library.
* **Dependencies:** Relies heavily on `FastAPI`, `Uvicorn`, `PyTorch`, `torchaudio`, `huggingface_hub`, `safetensors`, `descript-audio-codec`, `soundfile`, `PyYAML`, `python-dotenv`, `pydantic`, and `Jinja2`.
---
## 10. License and Disclaimer
* **License:** This project is licensed under the MIT License.
* **Disclaimer:** This project offers a high-fidelity speech generation model intended solely for research and educational use. The following uses are **strictly forbidden**:
* **Identity Misuse**: Do not produce audio resembling real individuals without permission.
* **Deceptive Content**: Do not use this model to generate misleading content (e.g. fake news)
* **Illegal or Malicious Use**: Do not use this model for activities that are illegal or intended to cause harm.
By using this model, you agree to uphold relevant legal standards and ethical responsibilities. The creators **are not responsible** for any misuse and firmly oppose any unethical usage of this technology.
---