A newer version of the Gradio SDK is available:
5.31.0
Nvidia Triton Inference Serving Best Practice for Spark TTS
Quick Start
Directly launch the service using docker compose.
docker compose up
Build Image
Build the docker image from scratch.
docker build . -f Dockerfile.server -t soar97/triton-spark-tts:25.02
Create Docker Container
your_mount_dir=/mnt:/mnt
docker run -it --name "spark-tts-server" --gpus all --net host -v $your_mount_dir --shm-size=2g soar97/triton-spark-tts:25.02
Understanding run.sh
The run.sh
script automates various steps using stages. You can run specific stages using:
bash run.sh <start_stage> <stop_stage> [service_type]
<start_stage>
: The stage to begin execution from (0-5).<stop_stage>
: The stage to end execution at (0-5).[service_type]
: Optional, specifies the service type ('streaming' or 'offline', defaults may apply based on script logic). Required for stages 4 and 5.
Stages:
- Stage 0: Download Spark-TTS-0.5B model from HuggingFace.
- Stage 1: Convert HuggingFace checkpoint to TensorRT-LLM format and build TensorRT engines.
- Stage 2: Create the Triton model repository structure and configure model files (adjusts for streaming/offline).
- Stage 3: Launch the Triton Inference Server.
- Stage 4: Run the gRPC benchmark client.
- Stage 5: Run the single utterance client (gRPC for streaming, HTTP for offline).
Export Models to TensorRT-LLM and Launch Server
Inside the docker container, you can prepare the models and launch the Triton server by running stages 0 through 3. This involves downloading the original model, converting it to TensorRT-LLM format, building the optimized TensorRT engines, creating the necessary model repository structure for Triton, and finally starting the server.
# This runs stages 0, 1, 2, and 3
bash run.sh 0 3
Note: Stage 2 prepares the model repository differently based on whether you intend to run streaming or offline inference later. You might need to re-run stage 2 if switching service types.
Single Utterance Client
Run a single inference request. Specify streaming
or offline
as the third argument.
Streaming Mode (gRPC):
bash run.sh 5 5 streaming
This executes the client_grpc.py
script with predefined example text and prompt audio in streaming mode.
Offline Mode (HTTP):
bash run.sh 5 5 offline
Benchmark using Dataset
Run the benchmark client against the running Triton server. Specify streaming
or offline
as the third argument.
# Run benchmark in streaming mode
bash run.sh 4 4 streaming
# Run benchmark in offline mode
bash run.sh 4 4 offline
# You can also customize parameters like num_task directly in client_grpc.py or via args if supported
# Example from run.sh (streaming):
# python3 client_grpc.py \
# --server-addr localhost \
# --model-name spark_tts \
# --num-tasks 2 \
# --mode streaming \
# --log-dir ./log_concurrent_tasks_2_streaming_new
# Example customizing dataset (requires modifying client_grpc.py or adding args):
# python3 client_grpc.py --num-tasks 2 --huggingface-dataset yuekai/seed_tts --split-name wenetspeech4tts --mode [streaming|offline]
Benchmark Results
Decoding on a single L20 GPU, using 26 different prompt_audio/target_text pairs, total audio duration 169 secs.
Mode | Note | Concurrency | Avg Latency | First Chunk Latency (P50) | RTF |
---|---|---|---|---|---|
Offline | Code Commit | 1 | 876.24 ms | - | 0.1362 |
Offline | Code Commit | 2 | 920.97 ms | - | 0.0737 |
Offline | Code Commit | 4 | 1611.51 ms | - | 0.0704 |
Streaming | Code Commit | 1 | 913.28 ms | 210.42 ms | 0.1501 |
Streaming | Code Commit | 2 | 1009.23 ms | 226.08 ms | 0.0862 |
Streaming | Code Commit | 4 | 1793.86 ms | 1017.70 ms | 0.0824 |