Spaces:

DroolingPanda
/

dia-tts-server

Running

File size: 41,730 Bytes

ac5de5b

# Dia TTS Server - Technical Documentation

**Version:** 1.0.0
**Date:** 2025-04-22

**Table of Contents:**

1.  [Overview](#1-overview)
2.  [Visual Overview](#2-visual-overview)
    *   [Directory Structure](#21-directory-structure)
    *   [Component Diagram](#22-component-diagram)
3.  [System Prerequisites](#3-system-prerequisites)
4.  [Installation and Setup](#4-installation-and-setup)
    *   [Cloning the Repository](#41-cloning-the-repository)
    *   [Setting up Python Virtual Environment](#42-setting-up-python-virtual-environment)
        *   [Windows Setup](#421-windows-setup)
        *   [Linux Setup (Debian/Ubuntu Example)](#422-linux-setup-debianubuntu-example)
    *   [Installing Dependencies](#43-installing-dependencies)
    *   [NVIDIA Driver and CUDA Setup (Required for GPU Acceleration)](#44-nvidia-driver-and-cuda-setup-required-for-gpu-acceleration)
        *   [Step 1: Check/Install NVIDIA Drivers](#441-step-1-checkinstall-nvidia-drivers)
        *   [Step 2: Install PyTorch with CUDA Support](#442-step-2-install-pytorch-with-cuda-support)
        *   [Step 3: Verify PyTorch CUDA Installation](#443-step-3-verify-pytorch-cuda-installation)
5.  [Configuration](#5-configuration)
    *   [Configuration Files (`.env` and `config.py`)](#51-configuration-files-env-and-configpy)
    *   [Configuration Parameters](#52-configuration-parameters)
6.  [Running the Server](#6-running-the-server)
7.  [Usage](#7-usage)
    *   [Web User Interface (Web UI)](#71-web-user-interface-web-ui)
        *   [Main Generation Form](#711-main-generation-form)
        *   [Presets](#712-presets)
        *   [Voice Cloning](#713-voice-cloning)
        *   [Generation Parameters](#714-generation-parameters)
        *   [Server Configuration (UI)](#715-server-configuration-ui)
        *   [Generated Audio Player](#716-generated-audio-player)
        *   [Theme Toggle](#717-theme-toggle)
    *   [API Endpoints](#72-api-endpoints)
        *   [POST /v1/audio/speech (OpenAI Compatible)](#721-post-v1audiospeech-openai-compatible)
        *   [POST /tts (Custom Parameters)](#722-post-tts-custom-parameters)
        *   [Configuration & Helper Endpoints](#723-configuration--helper-endpoints)
8.  [Troubleshooting](#8-troubleshooting)
9.  [Project Architecture](#9-project-architecture)
10. [License and Disclaimer](#10-license-and-disclaimer)

---

## 1. Overview

The Dia TTS Server provides a backend service and web interface for generating high-fidelity speech, including dialogue with multiple speakers and non-verbal sounds, using the Dia text-to-speech model family (originally from Nari Labs, with support for community conversions like SafeTensors).

This server is built using the FastAPI framework and offers both a RESTful API (including an OpenAI-compatible endpoint) and an interactive web UI powered by Jinja2, Tailwind CSS, and JavaScript. It supports voice cloning via audio prompts and allows configuration of various generation parameters.

**Key Features:**

*   **High-Quality TTS:** Leverages the Dia model for realistic speech synthesis.
*   **Dialogue Generation:** Supports `[S1]` and `[S2]` tags for multi-speaker dialogue.
*   **Non-Verbal Sounds:** Can generate sounds like `(laughs)`, `(sighs)`, etc., when included in the text.
*   **Voice Cloning:** Allows conditioning the output voice on a provided reference audio file.
*   **Flexible Model Loading:** Supports loading models from Hugging Face repositories, including both `.pth` and `.safetensors` formats (defaults to BF16 SafeTensors for efficiency).
*   **API Access:** Provides a custom API endpoint (`/tts`) and an OpenAI-compatible endpoint (`/v1/audio/speech`).
*   **Web Interface:** Offers an easy-to-use UI for text input, parameter adjustment, preset loading, reference audio management, and audio playback.
*   **Configuration:** Server settings, model sources, paths, and default generation parameters are configurable via an `.env` file.
*   **GPU Acceleration:** Utilizes NVIDIA GPUs via CUDA for significantly faster inference when available, falling back to CPU otherwise.

---

## 2. Visual Overview

### 2.1 Directory Structure

```

dia-tts-server/

│

├── .env                  # Local configuration overrides (user-created)

├── config.py             # Default configuration and management class

├── engine.py             # Core model loading and generation logic

├── models.py             # Pydantic models for API requests

├── requirements.txt      # Python dependencies

├── server.py             # Main FastAPI application, API endpoints, UI routes

├── utils.py              # Utility functions (audio encoding, saving, etc.)

│

├── dia/                  # Core Dia model implementation package

│   ├── __init__.py

│   ├── audio.py          # Audio processing helpers (delay, codebook conversion)

│   ├── config.py         # Pydantic models for Dia model architecture config

│   ├── layers.py         # Custom PyTorch layers for the Dia model

│   └── model.py          # Dia model class wrapper (loading, generation)

│

├── static/               # Static assets (e.g., favicon.ico)

│   └── favicon.ico

│

├── ui/                   # Web User Interface files

│   ├── index.html        # Main HTML template (Jinja2)

│   ├── presets.yaml      # Predefined UI examples

│   ├── script.js         # Frontend JavaScript logic

│   └── style.css         # Frontend CSS styling (Tailwind via CDN/build)

│

├── model_cache/          # Default directory for downloaded model files (configurable)

├── outputs/              # Default directory for saved audio output (configurable)

└── reference_audio/      # Default directory for voice cloning reference files (configurable)

```

### 2.2 Component Diagram

```

┌───────────────────┐      ┌───────────────────┐      ┌───────────────────┐      ┌───────────────────┐

│ User (Web UI /    │────→ │ FastAPI Server    │────→ │ TTS Engine        │────→ │ Dia Model Wrapper │

│ API Client)       │      │ (server.py)       │      │ (engine.py)       │      │ (dia/model.py)    │

└───────────────────┘      └─────────┬─────────┘      └─────────┬─────────┘      └─────────┬─────────┘

                                     │                          │                          │

                                     │ Uses                     │ Uses                     │ Uses

                                     ▼                          ▼                          ▼

                           ┌───────────────────┐      ┌───────────────────┐      ┌───────────────────┐

                           │ Configuration     │ ←─── │ .env File         │      │ Dia Model Layers  │

                           │ (config.py)       │      └───────────────────┘      │ (dia/layers.py)   │

                           └───────────────────┘                                 └───────────────────┘

                                     │                                                   │ Uses

                                     │ Uses                                                   │

                                     ▼                                                   │

                           ┌───────────────────┐                                         │ Uses

                           │ Utilities         │                                         ▼

                           │ (utils.py)        │                               ┌───────────────────┐

                           └───────────────────┘                               │ PyTorch / CUDA    │

                                     ▲                                         └───────────────────┘

                                     │ Uses                                             │ Uses

                                     │                                                  ▼

┌───────────────────┐      ┌───────────────────┐                           ┌───────────────────┐

│ Web UI Files      │ ←─── │ Jinja2 Templates  │                           │ DAC Model         │

│ (ui/)             │      └───────────────────┘                           │ (descript-audio..)│

└───────────────────┘               ▲                                      └───────────────────┘

                                    │ Renders                                        ▲

                                    │                                                │ Uses

                                    └────────────────────────────────────────────────┘

```

**Diagram Legend:**

*   Boxes represent major components or file groups.
*   Arrows (`→`) indicate primary data flow or control flow.
*   Lines with "Uses" indicate dependencies or function calls.

---

## 3. System Prerequisites

Before installing and running the Dia TTS Server, ensure your system meets the following requirements:

*   **Operating System:**
    *   Windows 10/11 (64-bit)
    *   Linux (Debian/Ubuntu recommended, other distributions may require adjustments)
*   **Python:** Python 3.10 or later (Python 3.10.x recommended based on tracebacks). Ensure Python and Pip are added to your system's PATH.
*   **Version Control:** Git (for cloning the repository).
*   **Internet Connection:** Required for downloading dependencies and model files.
*   **(Optional but Highly Recommended for Performance):**
    *   **NVIDIA GPU:** A CUDA-compatible NVIDIA GPU (Maxwell architecture or newer). Check compatibility [here](https://developer.nvidia.com/cuda-gpus). Sufficient VRAM is needed (BF16 model requires ~5-6GB, full precision ~10GB).
    *   **NVIDIA Drivers:** Latest appropriate drivers for your GPU and OS.
    *   **CUDA Toolkit:** Version compatible with the chosen PyTorch build (e.g., 11.8, 12.1). See [Section 4.4](#44-nvidia-driver-and-cuda-setup-required-for-gpu-acceleration).
*   **(Linux System Libraries):**
    *   `libsndfile1`: Required by the `soundfile` Python library for audio I/O. Install using your package manager (e.g., `sudo apt install libsndfile1` on Debian/Ubuntu).

---

## 4. Installation and Setup

Follow these steps to set up the project environment and install necessary dependencies.

### 4.1 Cloning the Repository

Open your terminal or command prompt and navigate to the directory where you want to store the project. Then, clone the repository:

```bash

git clone https://github.com/devnen/dia-tts-server.git # Replace with the actual repo URL if different

cd dia-tts-server

```

### 4.2 Setting up Python Virtual Environment

Using a virtual environment is strongly recommended to isolate project dependencies.

#### 4.2.1 Windows Setup

1.  **Open PowerShell or Command Prompt** in the project directory (`dia-tts-server`).
2.  **Create the virtual environment:**
    ```powershell

    python -m venv venv

    ```

3.  **Activate the virtual environment:**

    ```powershell

    .\venv\Scripts\activate

    ```

    Your terminal prompt should now be prefixed with `(venv)`.


#### 4.2.2 Linux Setup (Debian/Ubuntu Example)

1.  **Install prerequisites (if not already present):**
    ```bash

    sudo apt update

    sudo apt install python3 python3-venv python3-pip libsndfile1 -y

    ```

2.  **Open your terminal** in the project directory (`dia-tts-server`).

3.  **Create the virtual environment:**

    ```bash

    python3 -m venv venv

    ```

4.  **Activate the virtual environment:**

    ```bash

    source venv/bin/activate

    ```

    Your terminal prompt should now be prefixed with `(venv)`.


### 4.3 Installing Dependencies

With your virtual environment activated (`(venv)` prefix visible), install the required Python packages:

```bash

# Upgrade pip first (optional but good practice)

pip install --upgrade pip



# Install all dependencies from requirements.txt

pip install -r requirements.txt

```

**Note:** This command installs the CPU-only version of PyTorch by default. If you have a compatible NVIDIA GPU and want acceleration, proceed to [Section 4.4](#44-nvidia-driver-and-cuda-setup-required-for-gpu-acceleration) **before** running the server.

### 4.4 NVIDIA Driver and CUDA Setup (Required for GPU Acceleration)

Follow these steps **only if you have a compatible NVIDIA GPU** and want faster inference.

#### 4.4.1 Step 1: Check/Install NVIDIA Drivers

1.  **Check Existing Driver:** Open Command Prompt (Windows) or Terminal (Linux) and run:
    ```bash

    nvidia-smi

    ```

2.  **Interpret Output:**

    *   If the command runs successfully, note the **Driver Version** and the **CUDA Version** listed in the top right corner. This CUDA version is the *maximum* supported by your current driver.

    *   If the command fails ("not recognized"), you need to install or update your NVIDIA drivers.

3.  **Install/Update Drivers:** Go to the [NVIDIA Driver Downloads](https://www.nvidia.com/Download/index.aspx) page. Select your GPU model and OS, then download and install the latest recommended driver (Game Ready or Studio). **Reboot your computer** after installation. Run `nvidia-smi` again to confirm it works.


#### 4.4.2 Step 2: Install PyTorch with CUDA Support

1.  **Go to PyTorch Website:** Visit [https://pytorch.org/get-started/locally/](https://pytorch.org/get-started/locally/).
2.  **Configure:** Select:
    *   **PyTorch Build:** Stable
    *   **Your OS:** Windows or Linux
    *   **Package:** Pip
    *   **Language:** Python
    *   **Compute Platform:** Choose the CUDA version **equal to or lower than** the version reported by `nvidia-smi`. For example, if `nvidia-smi` shows `CUDA Version: 12.4`, select `CUDA 12.1`. If it shows `11.8`, select `CUDA 11.8`. **Do not select a version higher than your driver supports.** (CUDA 12.1 or 11.8 are common stable choices).
3.  **Copy Command:** Copy the generated installation command. It will look similar to:
    ```bash

    # Example for CUDA 12.1 (Windows/Linux):

    pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

    # Example for CUDA 11.8 (Windows/Linux):

    pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

    ```

    *(Use `pip` instead of `pip3` if that's your command)*

4.  **Install in Activated venv:**

    *   Ensure your `(venv)` is active.

    *   **Uninstall CPU PyTorch first:**

        ```bash

        pip uninstall torch torchvision torchaudio -y

        ```

    *   **Paste and run the copied command** from the PyTorch website.


#### 4.4.3 Step 3: Verify PyTorch CUDA Installation

1.  With the `(venv)` still active, start a Python interpreter:
    ```bash

    python

    ```

2.  Run the following Python code:

    ```python

    import torch

    print(f"PyTorch version: {torch.__version__}")

    cuda_available = torch.cuda.is_available()

    print(f"CUDA available: {cuda_available}")

    if cuda_available:

        print(f"CUDA version used by PyTorch: {torch.version.cuda}")

        print(f"Device count: {torch.cuda.device_count()}")

        print(f"Current device index: {torch.cuda.current_device()}")

        print(f"Device name: {torch.cuda.get_device_name(torch.cuda.current_device())}")

    else:

        print("CUDA not available to PyTorch. Ensure drivers and CUDA-enabled PyTorch are installed correctly.")

    exit()

    ```

3.  If `CUDA available:` shows `True`, the setup was successful. If `False`, review driver installation and the PyTorch installation command.


---

## 5. Configuration

The server's behavior, including model selection, paths, and default generation parameters, is controlled via configuration settings.

### 5.1 Configuration Files (`.env` and `config.py`)

*   **`config.py`:** Defines the *default* values for all configuration parameters in the `DEFAULT_CONFIG` dictionary. It also contains the `ConfigManager` class and getter functions used by the application.
*   **`.env` File:** This file, located in the project root directory (`dia-tts-server/.env`), allows you to *override* the default values. Create this file if it doesn't exist. Settings are defined as `KEY=VALUE` pairs, one per line. The server reads this file on startup using `python-dotenv`.

**Priority:** Values set in the `.env` file take precedence over the defaults in `config.py`. Environment variables set directly in your system also override `.env` file values (though using `.env` is generally recommended for project-specific settings).

### 5.2 Configuration Parameters

The following parameters can be set in your `.env` file:

| Parameter Name (in `.env`)         | Default Value (`config.py`)        | Description                                                                                                | Example `.env` Value                 |
| :--------------------------------- | :--------------------------------- | :--------------------------------------------------------------------------------------------------------- | :----------------------------------- |
| **Server Settings**                |                                    |                                                                                                            |                                      |
| `HOST`                             | `0.0.0.0`                          | The network interface address the server listens on. `0.0.0.0` makes it accessible on your local network. | `127.0.0.1` (localhost only)         |
| `PORT`                             | `8003`                             | The port number the server listens on.                                                                     | `8080`                               |
| **Model Source Settings**          |                                    |                                                                                                            |                                      |
| `DIA_MODEL_REPO_ID`                | `ttj/dia-1.6b-safetensors`         | The Hugging Face repository ID containing the model files.                                                 | `nari-labs/Dia-1.6B`                 |
| `DIA_MODEL_CONFIG_FILENAME`        | `config.json`                      | The filename of the model's configuration JSON within the repository.                                      | `config.json`                        |
| `DIA_MODEL_WEIGHTS_FILENAME`       | `dia-v0_1_bf16.safetensors`        | The filename of the model weights file (`.safetensors` or `.pth`) within the repository to load.           | `dia-v0_1.safetensors` or `dia-v0_1.pth` |
| **Path Settings**                  |                                    |                                                                                                            |                                      |
| `DIA_MODEL_CACHE_PATH`             | `./model_cache`                    | Local directory to store downloaded model files. Relative paths are based on the project root.             | `/path/to/shared/cache`              |
| `REFERENCE_AUDIO_PATH`             | `./reference_audio`                | Local directory to store reference audio files (`.wav`, `.mp3`) used for voice cloning.                      | `./voices`                           |
| `OUTPUT_PATH`                      | `./outputs`                        | Local directory where generated audio files from the Web UI are saved.                                     | `./generated_speech`                 |
| **Default Generation Parameters**  |                                    | *(These set the initial UI values and can be saved via the UI)*                                            |                                      |
| `GEN_DEFAULT_SPEED_FACTOR`         | `0.90`                             | Default playback speed factor applied *after* generation (UI slider initial value).                        | `1.0`                                |
| `GEN_DEFAULT_CFG_SCALE`            | `3.0`                              | Default Classifier-Free Guidance scale (UI slider initial value).                                          | `2.5`                                |
| `GEN_DEFAULT_TEMPERATURE`          | `1.3`                              | Default sampling temperature (UI slider initial value).                                                    | `1.2`                                |
| `GEN_DEFAULT_TOP_P`                | `0.95`                             | Default nucleus sampling probability (UI slider initial value).                                            | `0.9`                                |
| `GEN_DEFAULT_CFG_FILTER_TOP_K`     | `35`                               | Default Top-K value for CFG filtering (UI slider initial value).                                           | `40`                                 |

**Example `.env` File (Using Original Nari Labs Model):**

```dotenv

# .env

# Example configuration to use the original Nari Labs model



HOST=0.0.0.0

PORT=8003



DIA_MODEL_REPO_ID=nari-labs/Dia-1.6B

DIA_MODEL_CONFIG_FILENAME=config.json

DIA_MODEL_WEIGHTS_FILENAME=dia-v0_1.pth



# Keep other paths as default or specify custom ones

# DIA_MODEL_CACHE_PATH=./model_cache

# REFERENCE_AUDIO_PATH=./reference_audio

# OUTPUT_PATH=./outputs



# Keep default generation parameters or override them

# GEN_DEFAULT_SPEED_FACTOR=0.90

# GEN_DEFAULT_CFG_SCALE=3.0

# GEN_DEFAULT_TEMPERATURE=1.3

# GEN_DEFAULT_TOP_P=0.95

# GEN_DEFAULT_CFG_FILTER_TOP_K=35

```

**Important:** You must **restart the server** after making changes to the `.env` file for them to take effect.

---

## 6. Running the Server

1.  **Activate Virtual Environment:** Ensure your virtual environment is activated (`(venv)` prefix).
    *   Windows: `.\venv\Scripts\activate`
    *   Linux: `source venv/bin/activate`
2.  **Navigate to Project Root:** Make sure your terminal is in the `dia-tts-server` directory.
3.  **Run the Server:**
    ```bash

    python server.py

    ```

4.  **Server Output:** You should see log messages indicating the server is starting, including:

    *   The configuration being used (repo ID, filenames, paths).

    *   The device being used (CPU or CUDA).

    *   Model loading progress (downloading if necessary).

    *   Confirmation that the server is running (e.g., `Uvicorn running on http://0.0.0.0:8003`).

    *   URLs for accessing the Web UI and API Docs.


5.  **Accessing the Server:**
    *   **Web UI:** Open your web browser and go to `http://localhost:PORT` (e.g., `http://localhost:8003` if using the default port). If running on a different machine or VM, replace `localhost` with the server's IP address.
    *   **API Docs:** Access the interactive API documentation (Swagger UI) at `http://localhost:PORT/docs`.
6.  **Stopping the Server:** Press `CTRL+C` in the terminal where the server is running.

**Auto-Reload:** The server is configured to run with `reload=True`. This means Uvicorn will automatically restart the server if it detects changes in `.py`, `.html`, `.css`, `.js`, `.env`, or `.yaml` files within the project or `ui` directory. This is useful for development but should generally be disabled in production.

---

## 7. Usage

The Dia TTS Server can be used via its Web UI or its API endpoints.

### 7.1 Web User Interface (Web UI)

Access the UI by navigating to the server's base URL (e.g., `http://localhost:8003`).

#### 7.1.1 Main Generation Form

*   **Text to speak:** Enter the text you want to synthesize.
    *   Use `[S1]` and `[S2]` tags to indicate speaker turns for dialogue.
    *   Include non-verbal cues like `(laughs)`, `(sighs)`, `(clears throat)` directly in the text where desired.
    *   For voice cloning, **prepend the exact transcript** of the selected reference audio before the text you want generated (e.g., `[S1] Reference transcript text. [S1] This is the new text to generate in the cloned voice.`).
*   **Voice Mode:** Select the desired generation mode:
    *   **Single / Dialogue (Use [S1]/[S2]):** Use this for single-speaker text (you can use `[S1]` or omit tags if the model handles it) or multi-speaker dialogue (using `[S1]` and `[S2]`).
    *   **Voice Clone (from Reference):** Enables voice cloning based on a selected audio file. Requires selecting a file below and prepending its transcript to the text input.
*   **Generate Speech Button:** Submits the text and settings to the server to start generation.

#### 7.1.2 Presets

*   Located below the Voice Mode selection.
*   Clicking a preset button (e.g., "Standard Dialogue", "Expressive Narration") will automatically populate the "Text to speak" area and the "Generation Parameters" sliders with predefined values, demonstrating different use cases.

#### 7.1.3 Voice Cloning

*   This section appears only when "Voice Clone" mode is selected.
*   **Reference Audio File Dropdown:** Lists available `.wav` and `.mp3` files found in the configured `REFERENCE_AUDIO_PATH`. Select the file whose voice you want to clone. Remember to prepend its transcript to the main text input.
*   **Load Button:** Click this to open your system's file browser. You can select one or more `.wav` or `.mp3` files to upload. The selected files will be copied to the server's `REFERENCE_AUDIO_PATH`, and the dropdown list will refresh automatically. The first newly uploaded file will be selected in the dropdown.

#### 7.1.4 Generation Parameters

*   Expand this section to fine-tune the generation process. These values correspond to the parameters used by the underlying Dia model.
*   **Sliders:** Adjust Speed Factor, CFG Scale, Temperature, Top P, and CFG Filter Top K. The current value is displayed next to the label.
*   **Save Generation Defaults Button:** Saves the *current* values of these sliders to the `.env` file (as `GEN_DEFAULT_...` keys). These saved values will become the default settings loaded into the UI the next time the server starts.

#### 7.1.5 Server Configuration (UI)

*   Expand this section to view and modify server-level settings stored in the `.env` file.
*   **Fields:** Edit Model Repo ID, Config/Weights Filenames, Cache/Reference/Output Paths, Host, and Port.
*   **Save Server Configuration Button:** Saves the values currently shown in these fields to the `.env` file. **A server restart is required** for most of these changes (especially model source or paths) to take effect.
*   **Restart Server Button:** (Appears after saving) Attempts to trigger a server restart. This works best if the server was started with `reload=True` or is managed by a process manager like systemd or Supervisor.

#### 7.1.6 Generated Audio Player

*   Appears below the main form after a successful generation.
*   **Waveform:** Visual representation of the generated audio.
*   **Play/Pause Button:** Controls audio playback.
*   **Download WAV Button:** Downloads the generated audio as a `.wav` file.
*   **Info:** Displays the voice mode used, generation time, and audio duration.

#### 7.1.7 Theme Toggle

*   Located in the top-right navigation bar.
*   Click the Sun/Moon icon to switch between Light and Dark themes. Your preference is saved in your browser's `localStorage`.

### 7.2 API Endpoints

Access the interactive API documentation via the `/docs` path (e.g., `http://localhost:8003/docs`).

#### 7.2.1 POST `/v1/audio/speech` (OpenAI Compatible)

*   **Purpose:** Provides an endpoint compatible with the basic OpenAI TTS API for easier integration with existing tools.
*   **Request Body:** (`application/json`) - Uses the `OpenAITTSRequest` model.
    | Field             | Type                     | Required | Description                                                                                                                               | Example                     |

    | :---------------- | :----------------------- | :------- | :---------------------------------------------------------------------------------------------------------------------------------------- | :-------------------------- |

    | `model`           | string                   | No       | Ignored by this server (always uses Dia). Included for compatibility. Defaults to `dia-1.6b`.                                               | `"dia-1.6b"`                |

    | `input`           | string                   | Yes      | The text to synthesize. Use `[S1]`/`[S2]` tags for dialogue. For cloning, prepend reference transcript.                                    | `"Hello [S1] world."`       |

    | `voice`           | string                   | No       | Maps to Dia modes. Use `"S1"`, `"S2"`, `"dialogue"`, or the filename of a reference audio (e.g., `"my_ref.wav"`) for cloning. Defaults to `S1`. | `"dialogue"` or `"ref.mp3"` |

    | `response_format` | `"opus"` \| `"wav"`      | No       | Desired audio output format. Defaults to `opus`.                                                                                          | `"wav"`                     |

    | `speed`           | float                    | No       | Playback speed factor (0.5-2.0). Applied *after* generation. Defaults to `1.0`.                                                           | `0.9`                       |

*   **Response:**

    *   **Success (200 OK):** `StreamingResponse` containing the binary audio data (`audio/opus` or `audio/wav`).

    *   **Error:** Standard FastAPI JSON error response (e.g., 400, 404, 500).


#### 7.2.2 POST `/tts` (Custom Parameters)

*   **Purpose:** Allows generation using all specific Dia generation parameters.
*   **Request Body:** (`application/json`) - Uses the `CustomTTSRequest` model.
    | Field                      | Type                                   | Required | Description                                                                                                                               | Default     |

    | :------------------------- | :------------------------------------- | :------- | :---------------------------------------------------------------------------------------------------------------------------------------- | :---------- |

    | `text`                     | string                                 | Yes      | The text to synthesize. Use `[S1]`/`[S2]` tags. Prepend transcript for cloning.                                                           |             |

    | `voice_mode`               | `"dialogue"` \| `"clone"`              | No       | Generation mode. Note: `single_s1`/`single_s2` are handled via `dialogue` mode with appropriate tags in the text.                         | `dialogue`  |

    | `clone_reference_filename` | string \| null                         | No       | Filename of reference audio in `REFERENCE_AUDIO_PATH`. **Required if `voice_mode` is `clone`**.                                           | `null`      |

    | `output_format`            | `"opus"` \| `"wav"`                    | No       | Desired audio output format.                                                                                                              | `opus`      |

    | `max_tokens`               | integer \| null                        | No       | Maximum audio tokens to generate. `null` uses the model's default.                                                                        | `null`      |

    | `cfg_scale`                | float                                  | No       | Classifier-Free Guidance scale.                                                                                                           | `3.0`       |

    | `temperature`              | float                                  | No       | Sampling temperature.                                                                                                                     | `1.3`       |

    | `top_p`                    | float                                  | No       | Nucleus sampling probability.                                                                                                             | `0.95`      |

    | `speed_factor`             | float                                  | No       | Playback speed factor (0.5-2.0). Applied *after* generation.                                                                              | `0.90`      |

    | `cfg_filter_top_k`         | integer                                | No       | Top-K value for CFG filtering.                                                                                                            | `35`        |

*   **Response:**

    *   **Success (200 OK):** `StreamingResponse` containing the binary audio data (`audio/opus` or `audio/wav`).

    *   **Error:** Standard FastAPI JSON error response (e.g., 400, 404, 500).


#### 7.2.3 Configuration & Helper Endpoints

*   **GET `/get_config`:** Returns the current server configuration as JSON.

*   **POST `/save_config`:** Saves server configuration settings provided in the JSON request body to the `.env` file. Requires server restart.
*   **POST `/save_generation_defaults`:** Saves default generation parameters provided in the JSON request body to the `.env` file. Affects UI defaults on next load.
*   **POST `/restart_server`:** Attempts to trigger a server restart (reliability depends on execution environment).

*   **POST `/upload_reference`:** Uploads one or more audio files (`.wav`, `.mp3`) as `multipart/form-data` to the reference audio directory. Returns JSON with status and updated file list.
*   **GET `/health`:** Basic health check endpoint. Returns `{"status": "healthy", "model_loaded": true/false}`.

---

## 8. Troubleshooting

*   **Error: `CUDA available: False` or Slow Performance:**
    *   Verify NVIDIA drivers are installed correctly (`nvidia-smi` command).
    *   Ensure you installed the correct PyTorch version with CUDA support matching your driver (See [Section 4.4](#44-nvidia-driver-and-cuda-setup-required-for-gpu-acceleration)). Reinstall PyTorch using the command from the official website if unsure.
    *   Check if another process is using all GPU VRAM.
*   **Error: `ImportError: No module named 'dac'` (or `safetensors`, `yaml`, etc.):**
    *   Make sure your virtual environment is activated.
    *   Run `pip install -r requirements.txt` again to install missing dependencies.
    *   Specifically for `dac`, ensure you installed `descript-audio-codec` and not a different package named `dac`. Run `pip uninstall dac -y && pip install descript-audio-codec`.
*   **Error: `libsndfile library not found` (or similar `soundfile` error, mainly on Linux):**
    *   Install the system library: `sudo apt update && sudo apt install libsndfile1` (Debian/Ubuntu) or the equivalent for your distribution.
*   **Error: Model Download Fails (e.g., `HTTPError`, `ConnectionError`):**
    *   Check your internet connection.
    *   Verify the `DIA_MODEL_REPO_ID`, `DIA_MODEL_CONFIG_FILENAME`, and `DIA_MODEL_WEIGHTS_FILENAME` in your `.env` file (or defaults in `config.py`) are correct and accessible on Hugging Face Hub.
    *   Check Hugging Face Hub status if multiple downloads fail.
    *   Ensure the cache directory (`DIA_MODEL_CACHE_PATH`) is writable.
*   **Error: `RuntimeError: Failed to load DAC model...`:**
    *   This usually indicates an issue with the `descript-audio-codec` installation or version incompatibility. Ensure it's installed correctly (see `ImportError` above).
    *   Check logs for specific `AttributeError` messages (like missing `utils` or `download`) which might indicate version mismatches between the Dia code's expectation and the installed library. The current code expects `dac.utils.download()`.
*   **Error: `FileNotFoundError` during generation (Reference Audio):**
    *   Ensure the filename selected/provided for voice cloning exists in the configured `REFERENCE_AUDIO_PATH`.
    *   Check that the path in `config.py` or `.env` is correct and the server has permission to read from it.
*   **Error: Cannot Save Output/Reference Files (`PermissionError`, etc.):**
    *   Ensure the directories specified by `OUTPUT_PATH` and `REFERENCE_AUDIO_PATH` exist and the server process has write permissions to them.
*   **Web UI Issues (Buttons don't work, styles missing):**
    *   Clear your browser cache.
    *   Check the browser's developer console (usually F12) for JavaScript errors.
    *   Ensure `ui/script.js` and `ui/style.css` are being loaded correctly (check network tab in developer tools).
*   **Generation Cancel Button Doesn't Stop Process:**
    *   This is expected ("Fake Cancel"). The button currently only prevents the UI from processing the result when it eventually arrives. True cancellation is complex and not implemented. Clicking "Generate" again *will* cancel the *previous UI request's result processing* before starting the new one.

---

## 9. Project Architecture

*   **`server.py`:** The main entry point using FastAPI. Defines API routes, serves the Web UI using Jinja2, handles requests, and orchestrates calls to the engine.
*   **`engine.py`:** Responsible for loading the Dia model (including downloading files via `huggingface_hub`), managing the model instance, preparing inputs for the model's `generate` method based on user requests (handling voice modes), and calling the model's generation function. Also handles post-processing like speed adjustment.
*   **`config.py`:** Manages all configuration settings using default values and overrides from a `.env` file. Provides getter functions for easy access to settings.
*   **`dia/` package:** Contains the core implementation of the Dia model itself.
    *   `model.py`: Defines the `Dia` class, which wraps the underlying PyTorch model (`DiaModel`). It handles loading weights (`.pth` or `.safetensors`), loading the required DAC model, preparing inputs specifically for the `DiaModel` forward pass (including CFG logic), and running the autoregressive generation loop.
    *   `config.py` (within `dia/`): Defines Pydantic models representing the *structure* and hyperparameters of the Dia model architecture (encoder, decoder, data parameters). This is loaded from the `config.json` file associated with the model weights.
    *   `layers.py`: Contains custom PyTorch `nn.Module` implementations used within the `DiaModel` (e.g., Attention blocks, MLP blocks, RoPE).
    *   `audio.py`: Includes helper functions for audio processing specific to the model's tokenization and delay patterns (e.g., `audio_to_codebook`, `codebook_to_audio`, `apply_audio_delay`).
*   **`ui/` directory:** Contains all files related to the Web UI.
    *   `index.html`: The main Jinja2 template.
    *   `script.js`: Frontend JavaScript for interactivity, API calls, theme switching, etc.
    *   `presets.yaml`: Definitions for the UI preset examples.
*   **`utils.py`:** General utility functions, such as audio encoding (`encode_audio`) and saving (`save_audio_to_file`) using the `soundfile` library.
*   **Dependencies:** Relies heavily on `FastAPI`, `Uvicorn`, `PyTorch`, `torchaudio`, `huggingface_hub`, `safetensors`, `descript-audio-codec`, `soundfile`, `PyYAML`, `python-dotenv`, `pydantic`, and `Jinja2`.

---

## 10. License and Disclaimer

*   **License:** This project is licensed under the MIT License.
*   **Disclaimer:** This project offers a high-fidelity speech generation model intended solely for research and educational use. The following uses are **strictly forbidden**:
    *   **Identity Misuse**: Do not produce audio resembling real individuals without permission.
    *   **Deceptive Content**: Do not use this model to generate misleading content (e.g. fake news)
    *   **Illegal or Malicious Use**: Do not use this model for activities that are illegal or intended to cause harm.

    By using this model, you agree to uphold relevant legal standards and ethical responsibilities. The creators **are not responsible** for any misuse and firmly oppose any unethical usage of this technology.


---