Spaces:
Running
Running
# Dia TTS Server - Technical Documentation | |
**Version:** 1.0.0 | |
**Date:** 2025-04-22 | |
**Table of Contents:** | |
1. [Overview](#1-overview) | |
2. [Visual Overview](#2-visual-overview) | |
* [Directory Structure](#21-directory-structure) | |
* [Component Diagram](#22-component-diagram) | |
3. [System Prerequisites](#3-system-prerequisites) | |
4. [Installation and Setup](#4-installation-and-setup) | |
* [Cloning the Repository](#41-cloning-the-repository) | |
* [Setting up Python Virtual Environment](#42-setting-up-python-virtual-environment) | |
* [Windows Setup](#421-windows-setup) | |
* [Linux Setup (Debian/Ubuntu Example)](#422-linux-setup-debianubuntu-example) | |
* [Installing Dependencies](#43-installing-dependencies) | |
* [NVIDIA Driver and CUDA Setup (Required for GPU Acceleration)](#44-nvidia-driver-and-cuda-setup-required-for-gpu-acceleration) | |
* [Step 1: Check/Install NVIDIA Drivers](#441-step-1-checkinstall-nvidia-drivers) | |
* [Step 2: Install PyTorch with CUDA Support](#442-step-2-install-pytorch-with-cuda-support) | |
* [Step 3: Verify PyTorch CUDA Installation](#443-step-3-verify-pytorch-cuda-installation) | |
5. [Configuration](#5-configuration) | |
* [Configuration Files (`.env` and `config.py`)](#51-configuration-files-env-and-configpy) | |
* [Configuration Parameters](#52-configuration-parameters) | |
6. [Running the Server](#6-running-the-server) | |
7. [Usage](#7-usage) | |
* [Web User Interface (Web UI)](#71-web-user-interface-web-ui) | |
* [Main Generation Form](#711-main-generation-form) | |
* [Presets](#712-presets) | |
* [Voice Cloning](#713-voice-cloning) | |
* [Generation Parameters](#714-generation-parameters) | |
* [Server Configuration (UI)](#715-server-configuration-ui) | |
* [Generated Audio Player](#716-generated-audio-player) | |
* [Theme Toggle](#717-theme-toggle) | |
* [API Endpoints](#72-api-endpoints) | |
* [POST /v1/audio/speech (OpenAI Compatible)](#721-post-v1audiospeech-openai-compatible) | |
* [POST /tts (Custom Parameters)](#722-post-tts-custom-parameters) | |
* [Configuration & Helper Endpoints](#723-configuration--helper-endpoints) | |
8. [Troubleshooting](#8-troubleshooting) | |
9. [Project Architecture](#9-project-architecture) | |
10. [License and Disclaimer](#10-license-and-disclaimer) | |
--- | |
## 1. Overview | |
The Dia TTS Server provides a backend service and web interface for generating high-fidelity speech, including dialogue with multiple speakers and non-verbal sounds, using the Dia text-to-speech model family (originally from Nari Labs, with support for community conversions like SafeTensors). | |
This server is built using the FastAPI framework and offers both a RESTful API (including an OpenAI-compatible endpoint) and an interactive web UI powered by Jinja2, Tailwind CSS, and JavaScript. It supports voice cloning via audio prompts and allows configuration of various generation parameters. | |
**Key Features:** | |
* **High-Quality TTS:** Leverages the Dia model for realistic speech synthesis. | |
* **Dialogue Generation:** Supports `[S1]` and `[S2]` tags for multi-speaker dialogue. | |
* **Non-Verbal Sounds:** Can generate sounds like `(laughs)`, `(sighs)`, etc., when included in the text. | |
* **Voice Cloning:** Allows conditioning the output voice on a provided reference audio file. | |
* **Flexible Model Loading:** Supports loading models from Hugging Face repositories, including both `.pth` and `.safetensors` formats (defaults to BF16 SafeTensors for efficiency). | |
* **API Access:** Provides a custom API endpoint (`/tts`) and an OpenAI-compatible endpoint (`/v1/audio/speech`). | |
* **Web Interface:** Offers an easy-to-use UI for text input, parameter adjustment, preset loading, reference audio management, and audio playback. | |
* **Configuration:** Server settings, model sources, paths, and default generation parameters are configurable via an `.env` file. | |
* **GPU Acceleration:** Utilizes NVIDIA GPUs via CUDA for significantly faster inference when available, falling back to CPU otherwise. | |
--- | |
## 2. Visual Overview | |
### 2.1 Directory Structure | |
``` | |
dia-tts-server/ | |
β | |
βββ .env # Local configuration overrides (user-created) | |
βββ config.py # Default configuration and management class | |
βββ engine.py # Core model loading and generation logic | |
βββ models.py # Pydantic models for API requests | |
βββ requirements.txt # Python dependencies | |
βββ server.py # Main FastAPI application, API endpoints, UI routes | |
βββ utils.py # Utility functions (audio encoding, saving, etc.) | |
β | |
βββ dia/ # Core Dia model implementation package | |
β βββ __init__.py | |
β βββ audio.py # Audio processing helpers (delay, codebook conversion) | |
β βββ config.py # Pydantic models for Dia model architecture config | |
β βββ layers.py # Custom PyTorch layers for the Dia model | |
β βββ model.py # Dia model class wrapper (loading, generation) | |
β | |
βββ static/ # Static assets (e.g., favicon.ico) | |
β βββ favicon.ico | |
β | |
βββ ui/ # Web User Interface files | |
β βββ index.html # Main HTML template (Jinja2) | |
β βββ presets.yaml # Predefined UI examples | |
β βββ script.js # Frontend JavaScript logic | |
β βββ style.css # Frontend CSS styling (Tailwind via CDN/build) | |
β | |
βββ model_cache/ # Default directory for downloaded model files (configurable) | |
βββ outputs/ # Default directory for saved audio output (configurable) | |
βββ reference_audio/ # Default directory for voice cloning reference files (configurable) | |
``` | |
### 2.2 Component Diagram | |
``` | |
βββββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββββ | |
β User (Web UI / ββββββ β FastAPI Server ββββββ β TTS Engine ββββββ β Dia Model Wrapper β | |
β API Client) β β (server.py) β β (engine.py) β β (dia/model.py) β | |
βββββββββββββββββββββ βββββββββββ¬ββββββββββ βββββββββββ¬ββββββββββ βββββββββββ¬ββββββββββ | |
β β β | |
β Uses β Uses β Uses | |
βΌ βΌ βΌ | |
βββββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββββ | |
β Configuration β ββββ β .env File β β Dia Model Layers β | |
β (config.py) β βββββββββββββββββββββ β (dia/layers.py) β | |
βββββββββββββββββββββ βββββββββββββββββββββ | |
β β Uses | |
β Uses β | |
βΌ β | |
βββββββββββββββββββββ β Uses | |
β Utilities β βΌ | |
β (utils.py) β βββββββββββββββββββββ | |
βββββββββββββββββββββ β PyTorch / CUDA β | |
β² βββββββββββββββββββββ | |
β Uses β Uses | |
β βΌ | |
βββββββββββββββββββββ βββββββββββββββββββββ βββββββββββββββββββββ | |
β Web UI Files β ββββ β Jinja2 Templates β β DAC Model β | |
β (ui/) β βββββββββββββββββββββ β (descript-audio..)β | |
βββββββββββββββββββββ β² βββββββββββββββββββββ | |
β Renders β² | |
β β Uses | |
ββββββββββββββββββββββββββββββββββββββββββββββββββ | |
``` | |
**Diagram Legend:** | |
* Boxes represent major components or file groups. | |
* Arrows (`β`) indicate primary data flow or control flow. | |
* Lines with "Uses" indicate dependencies or function calls. | |
--- | |
## 3. System Prerequisites | |
Before installing and running the Dia TTS Server, ensure your system meets the following requirements: | |
* **Operating System:** | |
* Windows 10/11 (64-bit) | |
* Linux (Debian/Ubuntu recommended, other distributions may require adjustments) | |
* **Python:** Python 3.10 or later (Python 3.10.x recommended based on tracebacks). Ensure Python and Pip are added to your system's PATH. | |
* **Version Control:** Git (for cloning the repository). | |
* **Internet Connection:** Required for downloading dependencies and model files. | |
* **(Optional but Highly Recommended for Performance):** | |
* **NVIDIA GPU:** A CUDA-compatible NVIDIA GPU (Maxwell architecture or newer). Check compatibility [here](https://developer.nvidia.com/cuda-gpus). Sufficient VRAM is needed (BF16 model requires ~5-6GB, full precision ~10GB). | |
* **NVIDIA Drivers:** Latest appropriate drivers for your GPU and OS. | |
* **CUDA Toolkit:** Version compatible with the chosen PyTorch build (e.g., 11.8, 12.1). See [Section 4.4](#44-nvidia-driver-and-cuda-setup-required-for-gpu-acceleration). | |
* **(Linux System Libraries):** | |
* `libsndfile1`: Required by the `soundfile` Python library for audio I/O. Install using your package manager (e.g., `sudo apt install libsndfile1` on Debian/Ubuntu). | |
--- | |
## 4. Installation and Setup | |
Follow these steps to set up the project environment and install necessary dependencies. | |
### 4.1 Cloning the Repository | |
Open your terminal or command prompt and navigate to the directory where you want to store the project. Then, clone the repository: | |
```bash | |
git clone https://github.com/devnen/dia-tts-server.git # Replace with the actual repo URL if different | |
cd dia-tts-server | |
``` | |
### 4.2 Setting up Python Virtual Environment | |
Using a virtual environment is strongly recommended to isolate project dependencies. | |
#### 4.2.1 Windows Setup | |
1. **Open PowerShell or Command Prompt** in the project directory (`dia-tts-server`). | |
2. **Create the virtual environment:** | |
```powershell | |
python -m venv venv | |
``` | |
3. **Activate the virtual environment:** | |
```powershell | |
.\venv\Scripts\activate | |
``` | |
Your terminal prompt should now be prefixed with `(venv)`. | |
#### 4.2.2 Linux Setup (Debian/Ubuntu Example) | |
1. **Install prerequisites (if not already present):** | |
```bash | |
sudo apt update | |
sudo apt install python3 python3-venv python3-pip libsndfile1 -y | |
``` | |
2. **Open your terminal** in the project directory (`dia-tts-server`). | |
3. **Create the virtual environment:** | |
```bash | |
python3 -m venv venv | |
``` | |
4. **Activate the virtual environment:** | |
```bash | |
source venv/bin/activate | |
``` | |
Your terminal prompt should now be prefixed with `(venv)`. | |
### 4.3 Installing Dependencies | |
With your virtual environment activated (`(venv)` prefix visible), install the required Python packages: | |
```bash | |
# Upgrade pip first (optional but good practice) | |
pip install --upgrade pip | |
# Install all dependencies from requirements.txt | |
pip install -r requirements.txt | |
``` | |
**Note:** This command installs the CPU-only version of PyTorch by default. If you have a compatible NVIDIA GPU and want acceleration, proceed to [Section 4.4](#44-nvidia-driver-and-cuda-setup-required-for-gpu-acceleration) **before** running the server. | |
### 4.4 NVIDIA Driver and CUDA Setup (Required for GPU Acceleration) | |
Follow these steps **only if you have a compatible NVIDIA GPU** and want faster inference. | |
#### 4.4.1 Step 1: Check/Install NVIDIA Drivers | |
1. **Check Existing Driver:** Open Command Prompt (Windows) or Terminal (Linux) and run: | |
```bash | |
nvidia-smi | |
``` | |
2. **Interpret Output:** | |
* If the command runs successfully, note the **Driver Version** and the **CUDA Version** listed in the top right corner. This CUDA version is the *maximum* supported by your current driver. | |
* If the command fails ("not recognized"), you need to install or update your NVIDIA drivers. | |
3. **Install/Update Drivers:** Go to the [NVIDIA Driver Downloads](https://www.nvidia.com/Download/index.aspx) page. Select your GPU model and OS, then download and install the latest recommended driver (Game Ready or Studio). **Reboot your computer** after installation. Run `nvidia-smi` again to confirm it works. | |
#### 4.4.2 Step 2: Install PyTorch with CUDA Support | |
1. **Go to PyTorch Website:** Visit [https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/). | |
2. **Configure:** Select: | |
* **PyTorch Build:** Stable | |
* **Your OS:** Windows or Linux | |
* **Package:** Pip | |
* **Language:** Python | |
* **Compute Platform:** Choose the CUDA version **equal to or lower than** the version reported by `nvidia-smi`. For example, if `nvidia-smi` shows `CUDA Version: 12.4`, select `CUDA 12.1`. If it shows `11.8`, select `CUDA 11.8`. **Do not select a version higher than your driver supports.** (CUDA 12.1 or 11.8 are common stable choices). | |
3. **Copy Command:** Copy the generated installation command. It will look similar to: | |
```bash | |
# Example for CUDA 12.1 (Windows/Linux): | |
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 | |
# Example for CUDA 11.8 (Windows/Linux): | |
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 | |
``` | |
*(Use `pip` instead of `pip3` if that's your command)* | |
4. **Install in Activated venv:** | |
* Ensure your `(venv)` is active. | |
* **Uninstall CPU PyTorch first:** | |
```bash | |
pip uninstall torch torchvision torchaudio -y | |
``` | |
* **Paste and run the copied command** from the PyTorch website. | |
#### 4.4.3 Step 3: Verify PyTorch CUDA Installation | |
1. With the `(venv)` still active, start a Python interpreter: | |
```bash | |
python | |
``` | |
2. Run the following Python code: | |
```python | |
import torch | |
print(f"PyTorch version: {torch.__version__}") | |
cuda_available = torch.cuda.is_available() | |
print(f"CUDA available: {cuda_available}") | |
if cuda_available: | |
print(f"CUDA version used by PyTorch: {torch.version.cuda}") | |
print(f"Device count: {torch.cuda.device_count()}") | |
print(f"Current device index: {torch.cuda.current_device()}") | |
print(f"Device name: {torch.cuda.get_device_name(torch.cuda.current_device())}") | |
else: | |
print("CUDA not available to PyTorch. Ensure drivers and CUDA-enabled PyTorch are installed correctly.") | |
exit() | |
``` | |
3. If `CUDA available:` shows `True`, the setup was successful. If `False`, review driver installation and the PyTorch installation command. | |
--- | |
## 5. Configuration | |
The server's behavior, including model selection, paths, and default generation parameters, is controlled via configuration settings. | |
### 5.1 Configuration Files (`.env` and `config.py`) | |
* **`config.py`:** Defines the *default* values for all configuration parameters in the `DEFAULT_CONFIG` dictionary. It also contains the `ConfigManager` class and getter functions used by the application. | |
* **`.env` File:** This file, located in the project root directory (`dia-tts-server/.env`), allows you to *override* the default values. Create this file if it doesn't exist. Settings are defined as `KEY=VALUE` pairs, one per line. The server reads this file on startup using `python-dotenv`. | |
**Priority:** Values set in the `.env` file take precedence over the defaults in `config.py`. Environment variables set directly in your system also override `.env` file values (though using `.env` is generally recommended for project-specific settings). | |
### 5.2 Configuration Parameters | |
The following parameters can be set in your `.env` file: | |
| Parameter Name (in `.env`) | Default Value (`config.py`) | Description | Example `.env` Value | | |
| :--------------------------------- | :--------------------------------- | :--------------------------------------------------------------------------------------------------------- | :----------------------------------- | | |
| **Server Settings** | | | | | |
| `HOST` | `0.0.0.0` | The network interface address the server listens on. `0.0.0.0` makes it accessible on your local network. | `127.0.0.1` (localhost only) | | |
| `PORT` | `8003` | The port number the server listens on. | `8080` | | |
| **Model Source Settings** | | | | | |
| `DIA_MODEL_REPO_ID` | `ttj/dia-1.6b-safetensors` | The Hugging Face repository ID containing the model files. | `nari-labs/Dia-1.6B` | | |
| `DIA_MODEL_CONFIG_FILENAME` | `config.json` | The filename of the model's configuration JSON within the repository. | `config.json` | | |
| `DIA_MODEL_WEIGHTS_FILENAME` | `dia-v0_1_bf16.safetensors` | The filename of the model weights file (`.safetensors` or `.pth`) within the repository to load. | `dia-v0_1.safetensors` or `dia-v0_1.pth` | | |
| **Path Settings** | | | | | |
| `DIA_MODEL_CACHE_PATH` | `./model_cache` | Local directory to store downloaded model files. Relative paths are based on the project root. | `/path/to/shared/cache` | | |
| `REFERENCE_AUDIO_PATH` | `./reference_audio` | Local directory to store reference audio files (`.wav`, `.mp3`) used for voice cloning. | `./voices` | | |
| `OUTPUT_PATH` | `./outputs` | Local directory where generated audio files from the Web UI are saved. | `./generated_speech` | | |
| **Default Generation Parameters** | | *(These set the initial UI values and can be saved via the UI)* | | | |
| `GEN_DEFAULT_SPEED_FACTOR` | `0.90` | Default playback speed factor applied *after* generation (UI slider initial value). | `1.0` | | |
| `GEN_DEFAULT_CFG_SCALE` | `3.0` | Default Classifier-Free Guidance scale (UI slider initial value). | `2.5` | | |
| `GEN_DEFAULT_TEMPERATURE` | `1.3` | Default sampling temperature (UI slider initial value). | `1.2` | | |
| `GEN_DEFAULT_TOP_P` | `0.95` | Default nucleus sampling probability (UI slider initial value). | `0.9` | | |
| `GEN_DEFAULT_CFG_FILTER_TOP_K` | `35` | Default Top-K value for CFG filtering (UI slider initial value). | `40` | | |
**Example `.env` File (Using Original Nari Labs Model):** | |
```dotenv | |
# .env | |
# Example configuration to use the original Nari Labs model | |
HOST=0.0.0.0 | |
PORT=8003 | |
DIA_MODEL_REPO_ID=nari-labs/Dia-1.6B | |
DIA_MODEL_CONFIG_FILENAME=config.json | |
DIA_MODEL_WEIGHTS_FILENAME=dia-v0_1.pth | |
# Keep other paths as default or specify custom ones | |
# DIA_MODEL_CACHE_PATH=./model_cache | |
# REFERENCE_AUDIO_PATH=./reference_audio | |
# OUTPUT_PATH=./outputs | |
# Keep default generation parameters or override them | |
# GEN_DEFAULT_SPEED_FACTOR=0.90 | |
# GEN_DEFAULT_CFG_SCALE=3.0 | |
# GEN_DEFAULT_TEMPERATURE=1.3 | |
# GEN_DEFAULT_TOP_P=0.95 | |
# GEN_DEFAULT_CFG_FILTER_TOP_K=35 | |
``` | |
**Important:** You must **restart the server** after making changes to the `.env` file for them to take effect. | |
--- | |
## 6. Running the Server | |
1. **Activate Virtual Environment:** Ensure your virtual environment is activated (`(venv)` prefix). | |
* Windows: `.\venv\Scripts\activate` | |
* Linux: `source venv/bin/activate` | |
2. **Navigate to Project Root:** Make sure your terminal is in the `dia-tts-server` directory. | |
3. **Run the Server:** | |
```bash | |
python server.py | |
``` | |
4. **Server Output:** You should see log messages indicating the server is starting, including: | |
* The configuration being used (repo ID, filenames, paths). | |
* The device being used (CPU or CUDA). | |
* Model loading progress (downloading if necessary). | |
* Confirmation that the server is running (e.g., `Uvicorn running on http://0.0.0.0:8003`). | |
* URLs for accessing the Web UI and API Docs. | |
5. **Accessing the Server:** | |
* **Web UI:** Open your web browser and go to `http://localhost:PORT` (e.g., `http://localhost:8003` if using the default port). If running on a different machine or VM, replace `localhost` with the server's IP address. | |
* **API Docs:** Access the interactive API documentation (Swagger UI) at `http://localhost:PORT/docs`. | |
6. **Stopping the Server:** Press `CTRL+C` in the terminal where the server is running. | |
**Auto-Reload:** The server is configured to run with `reload=True`. This means Uvicorn will automatically restart the server if it detects changes in `.py`, `.html`, `.css`, `.js`, `.env`, or `.yaml` files within the project or `ui` directory. This is useful for development but should generally be disabled in production. | |
--- | |
## 7. Usage | |
The Dia TTS Server can be used via its Web UI or its API endpoints. | |
### 7.1 Web User Interface (Web UI) | |
Access the UI by navigating to the server's base URL (e.g., `http://localhost:8003`). | |
#### 7.1.1 Main Generation Form | |
* **Text to speak:** Enter the text you want to synthesize. | |
* Use `[S1]` and `[S2]` tags to indicate speaker turns for dialogue. | |
* Include non-verbal cues like `(laughs)`, `(sighs)`, `(clears throat)` directly in the text where desired. | |
* For voice cloning, **prepend the exact transcript** of the selected reference audio before the text you want generated (e.g., `[S1] Reference transcript text. [S1] This is the new text to generate in the cloned voice.`). | |
* **Voice Mode:** Select the desired generation mode: | |
* **Single / Dialogue (Use [S1]/[S2]):** Use this for single-speaker text (you can use `[S1]` or omit tags if the model handles it) or multi-speaker dialogue (using `[S1]` and `[S2]`). | |
* **Voice Clone (from Reference):** Enables voice cloning based on a selected audio file. Requires selecting a file below and prepending its transcript to the text input. | |
* **Generate Speech Button:** Submits the text and settings to the server to start generation. | |
#### 7.1.2 Presets | |
* Located below the Voice Mode selection. | |
* Clicking a preset button (e.g., "Standard Dialogue", "Expressive Narration") will automatically populate the "Text to speak" area and the "Generation Parameters" sliders with predefined values, demonstrating different use cases. | |
#### 7.1.3 Voice Cloning | |
* This section appears only when "Voice Clone" mode is selected. | |
* **Reference Audio File Dropdown:** Lists available `.wav` and `.mp3` files found in the configured `REFERENCE_AUDIO_PATH`. Select the file whose voice you want to clone. Remember to prepend its transcript to the main text input. | |
* **Load Button:** Click this to open your system's file browser. You can select one or more `.wav` or `.mp3` files to upload. The selected files will be copied to the server's `REFERENCE_AUDIO_PATH`, and the dropdown list will refresh automatically. The first newly uploaded file will be selected in the dropdown. | |
#### 7.1.4 Generation Parameters | |
* Expand this section to fine-tune the generation process. These values correspond to the parameters used by the underlying Dia model. | |
* **Sliders:** Adjust Speed Factor, CFG Scale, Temperature, Top P, and CFG Filter Top K. The current value is displayed next to the label. | |
* **Save Generation Defaults Button:** Saves the *current* values of these sliders to the `.env` file (as `GEN_DEFAULT_...` keys). These saved values will become the default settings loaded into the UI the next time the server starts. | |
#### 7.1.5 Server Configuration (UI) | |
* Expand this section to view and modify server-level settings stored in the `.env` file. | |
* **Fields:** Edit Model Repo ID, Config/Weights Filenames, Cache/Reference/Output Paths, Host, and Port. | |
* **Save Server Configuration Button:** Saves the values currently shown in these fields to the `.env` file. **A server restart is required** for most of these changes (especially model source or paths) to take effect. | |
* **Restart Server Button:** (Appears after saving) Attempts to trigger a server restart. This works best if the server was started with `reload=True` or is managed by a process manager like systemd or Supervisor. | |
#### 7.1.6 Generated Audio Player | |
* Appears below the main form after a successful generation. | |
* **Waveform:** Visual representation of the generated audio. | |
* **Play/Pause Button:** Controls audio playback. | |
* **Download WAV Button:** Downloads the generated audio as a `.wav` file. | |
* **Info:** Displays the voice mode used, generation time, and audio duration. | |
#### 7.1.7 Theme Toggle | |
* Located in the top-right navigation bar. | |
* Click the Sun/Moon icon to switch between Light and Dark themes. Your preference is saved in your browser's `localStorage`. | |
### 7.2 API Endpoints | |
Access the interactive API documentation via the `/docs` path (e.g., `http://localhost:8003/docs`). | |
#### 7.2.1 POST `/v1/audio/speech` (OpenAI Compatible) | |
* **Purpose:** Provides an endpoint compatible with the basic OpenAI TTS API for easier integration with existing tools. | |
* **Request Body:** (`application/json`) - Uses the `OpenAITTSRequest` model. | |
| Field | Type | Required | Description | Example | | |
| :---------------- | :----------------------- | :------- | :---------------------------------------------------------------------------------------------------------------------------------------- | :-------------------------- | | |
| `model` | string | No | Ignored by this server (always uses Dia). Included for compatibility. Defaults to `dia-1.6b`. | `"dia-1.6b"` | | |
| `input` | string | Yes | The text to synthesize. Use `[S1]`/`[S2]` tags for dialogue. For cloning, prepend reference transcript. | `"Hello [S1] world."` | | |
| `voice` | string | No | Maps to Dia modes. Use `"S1"`, `"S2"`, `"dialogue"`, or the filename of a reference audio (e.g., `"my_ref.wav"`) for cloning. Defaults to `S1`. | `"dialogue"` or `"ref.mp3"` | | |
| `response_format` | `"opus"` \| `"wav"` | No | Desired audio output format. Defaults to `opus`. | `"wav"` | | |
| `speed` | float | No | Playback speed factor (0.5-2.0). Applied *after* generation. Defaults to `1.0`. | `0.9` | | |
* **Response:** | |
* **Success (200 OK):** `StreamingResponse` containing the binary audio data (`audio/opus` or `audio/wav`). | |
* **Error:** Standard FastAPI JSON error response (e.g., 400, 404, 500). | |
#### 7.2.2 POST `/tts` (Custom Parameters) | |
* **Purpose:** Allows generation using all specific Dia generation parameters. | |
* **Request Body:** (`application/json`) - Uses the `CustomTTSRequest` model. | |
| Field | Type | Required | Description | Default | | |
| :------------------------- | :------------------------------------- | :------- | :---------------------------------------------------------------------------------------------------------------------------------------- | :---------- | | |
| `text` | string | Yes | The text to synthesize. Use `[S1]`/`[S2]` tags. Prepend transcript for cloning. | | | |
| `voice_mode` | `"dialogue"` \| `"clone"` | No | Generation mode. Note: `single_s1`/`single_s2` are handled via `dialogue` mode with appropriate tags in the text. | `dialogue` | | |
| `clone_reference_filename` | string \| null | No | Filename of reference audio in `REFERENCE_AUDIO_PATH`. **Required if `voice_mode` is `clone`**. | `null` | | |
| `output_format` | `"opus"` \| `"wav"` | No | Desired audio output format. | `opus` | | |
| `max_tokens` | integer \| null | No | Maximum audio tokens to generate. `null` uses the model's default. | `null` | | |
| `cfg_scale` | float | No | Classifier-Free Guidance scale. | `3.0` | | |
| `temperature` | float | No | Sampling temperature. | `1.3` | | |
| `top_p` | float | No | Nucleus sampling probability. | `0.95` | | |
| `speed_factor` | float | No | Playback speed factor (0.5-2.0). Applied *after* generation. | `0.90` | | |
| `cfg_filter_top_k` | integer | No | Top-K value for CFG filtering. | `35` | | |
* **Response:** | |
* **Success (200 OK):** `StreamingResponse` containing the binary audio data (`audio/opus` or `audio/wav`). | |
* **Error:** Standard FastAPI JSON error response (e.g., 400, 404, 500). | |
#### 7.2.3 Configuration & Helper Endpoints | |
* **GET `/get_config`:** Returns the current server configuration as JSON. | |
* **POST `/save_config`:** Saves server configuration settings provided in the JSON request body to the `.env` file. Requires server restart. | |
* **POST `/save_generation_defaults`:** Saves default generation parameters provided in the JSON request body to the `.env` file. Affects UI defaults on next load. | |
* **POST `/restart_server`:** Attempts to trigger a server restart (reliability depends on execution environment). | |
* **POST `/upload_reference`:** Uploads one or more audio files (`.wav`, `.mp3`) as `multipart/form-data` to the reference audio directory. Returns JSON with status and updated file list. | |
* **GET `/health`:** Basic health check endpoint. Returns `{"status": "healthy", "model_loaded": true/false}`. | |
--- | |
## 8. Troubleshooting | |
* **Error: `CUDA available: False` or Slow Performance:** | |
* Verify NVIDIA drivers are installed correctly (`nvidia-smi` command). | |
* Ensure you installed the correct PyTorch version with CUDA support matching your driver (See [Section 4.4](#44-nvidia-driver-and-cuda-setup-required-for-gpu-acceleration)). Reinstall PyTorch using the command from the official website if unsure. | |
* Check if another process is using all GPU VRAM. | |
* **Error: `ImportError: No module named 'dac'` (or `safetensors`, `yaml`, etc.):** | |
* Make sure your virtual environment is activated. | |
* Run `pip install -r requirements.txt` again to install missing dependencies. | |
* Specifically for `dac`, ensure you installed `descript-audio-codec` and not a different package named `dac`. Run `pip uninstall dac -y && pip install descript-audio-codec`. | |
* **Error: `libsndfile library not found` (or similar `soundfile` error, mainly on Linux):** | |
* Install the system library: `sudo apt update && sudo apt install libsndfile1` (Debian/Ubuntu) or the equivalent for your distribution. | |
* **Error: Model Download Fails (e.g., `HTTPError`, `ConnectionError`):** | |
* Check your internet connection. | |
* Verify the `DIA_MODEL_REPO_ID`, `DIA_MODEL_CONFIG_FILENAME`, and `DIA_MODEL_WEIGHTS_FILENAME` in your `.env` file (or defaults in `config.py`) are correct and accessible on Hugging Face Hub. | |
* Check Hugging Face Hub status if multiple downloads fail. | |
* Ensure the cache directory (`DIA_MODEL_CACHE_PATH`) is writable. | |
* **Error: `RuntimeError: Failed to load DAC model...`:** | |
* This usually indicates an issue with the `descript-audio-codec` installation or version incompatibility. Ensure it's installed correctly (see `ImportError` above). | |
* Check logs for specific `AttributeError` messages (like missing `utils` or `download`) which might indicate version mismatches between the Dia code's expectation and the installed library. The current code expects `dac.utils.download()`. | |
* **Error: `FileNotFoundError` during generation (Reference Audio):** | |
* Ensure the filename selected/provided for voice cloning exists in the configured `REFERENCE_AUDIO_PATH`. | |
* Check that the path in `config.py` or `.env` is correct and the server has permission to read from it. | |
* **Error: Cannot Save Output/Reference Files (`PermissionError`, etc.):** | |
* Ensure the directories specified by `OUTPUT_PATH` and `REFERENCE_AUDIO_PATH` exist and the server process has write permissions to them. | |
* **Web UI Issues (Buttons don't work, styles missing):** | |
* Clear your browser cache. | |
* Check the browser's developer console (usually F12) for JavaScript errors. | |
* Ensure `ui/script.js` and `ui/style.css` are being loaded correctly (check network tab in developer tools). | |
* **Generation Cancel Button Doesn't Stop Process:** | |
* This is expected ("Fake Cancel"). The button currently only prevents the UI from processing the result when it eventually arrives. True cancellation is complex and not implemented. Clicking "Generate" again *will* cancel the *previous UI request's result processing* before starting the new one. | |
--- | |
## 9. Project Architecture | |
* **`server.py`:** The main entry point using FastAPI. Defines API routes, serves the Web UI using Jinja2, handles requests, and orchestrates calls to the engine. | |
* **`engine.py`:** Responsible for loading the Dia model (including downloading files via `huggingface_hub`), managing the model instance, preparing inputs for the model's `generate` method based on user requests (handling voice modes), and calling the model's generation function. Also handles post-processing like speed adjustment. | |
* **`config.py`:** Manages all configuration settings using default values and overrides from a `.env` file. Provides getter functions for easy access to settings. | |
* **`dia/` package:** Contains the core implementation of the Dia model itself. | |
* `model.py`: Defines the `Dia` class, which wraps the underlying PyTorch model (`DiaModel`). It handles loading weights (`.pth` or `.safetensors`), loading the required DAC model, preparing inputs specifically for the `DiaModel` forward pass (including CFG logic), and running the autoregressive generation loop. | |
* `config.py` (within `dia/`): Defines Pydantic models representing the *structure* and hyperparameters of the Dia model architecture (encoder, decoder, data parameters). This is loaded from the `config.json` file associated with the model weights. | |
* `layers.py`: Contains custom PyTorch `nn.Module` implementations used within the `DiaModel` (e.g., Attention blocks, MLP blocks, RoPE). | |
* `audio.py`: Includes helper functions for audio processing specific to the model's tokenization and delay patterns (e.g., `audio_to_codebook`, `codebook_to_audio`, `apply_audio_delay`). | |
* **`ui/` directory:** Contains all files related to the Web UI. | |
* `index.html`: The main Jinja2 template. | |
* `script.js`: Frontend JavaScript for interactivity, API calls, theme switching, etc. | |
* `presets.yaml`: Definitions for the UI preset examples. | |
* **`utils.py`:** General utility functions, such as audio encoding (`encode_audio`) and saving (`save_audio_to_file`) using the `soundfile` library. | |
* **Dependencies:** Relies heavily on `FastAPI`, `Uvicorn`, `PyTorch`, `torchaudio`, `huggingface_hub`, `safetensors`, `descript-audio-codec`, `soundfile`, `PyYAML`, `python-dotenv`, `pydantic`, and `Jinja2`. | |
--- | |
## 10. License and Disclaimer | |
* **License:** This project is licensed under the MIT License. | |
* **Disclaimer:** This project offers a high-fidelity speech generation model intended solely for research and educational use. The following uses are **strictly forbidden**: | |
* **Identity Misuse**: Do not produce audio resembling real individuals without permission. | |
* **Deceptive Content**: Do not use this model to generate misleading content (e.g. fake news) | |
* **Illegal or Malicious Use**: Do not use this model for activities that are illegal or intended to cause harm. | |
By using this model, you agree to uphold relevant legal standards and ethical responsibilities. The creators **are not responsible** for any misuse and firmly oppose any unethical usage of this technology. | |
--- | |