WhisperLive-TensorRT

We have only tested the TensorRT backend in docker so, we recommend docker for a smooth TensorRT backend setup. Note: We use tensorrt_llm==0.18.2

Installation

Install docker
Install nvidia-container-toolkit
Run WhisperLive TensorRT in docker

docker build . -f docker/Dockerfile.tensorrt -t whisperlive-tensorrt
docker run -p 9090:9090 --runtime=nvidia --gpus all --entrypoint /bin/bash -it whisperlive-tensorrt

Whisper TensorRT Engine

We build small.en and small multilingual TensorRT engine as examples below. The script logs the path of the directory with Whisper TensorRT engine. We need that model_path to run the server.

# convert small.en
bash build_whisper_tensorrt.sh /app/TensorRT-LLM-examples small.en        # float16
bash build_whisper_tensorrt.sh /app/TensorRT-LLM-examples small.en int8   # int8 weight only quantization
bash build_whisper_tensorrt.sh /app/TensorRT-LLM-examples small.en int4   # int4 weight only quantization

# convert small multilingual model
bash build_whisper_tensorrt.sh /app/TensorRT-LLM-examples small

Run WhisperLive Server with TensorRT Backend

# Run English only model
python3 run_server.py --port 9090 \
                      --backend tensorrt \
                      --trt_model_path "/app/TensorRT-LLM-examples/whisper/whisper_small_en_float16"

# Run Multilingual model
python3 run_server.py --port 9090 \
                      --backend tensorrt \
                      --trt_model_path "/app/TensorRT-LLM-examples/whisper/whisper_small_float16" \
                      --trt_multilingual

By default trt_backend uses cpp_session, to use python session pass --trt_py_session to run_server.py

python3 run_server.py --port 9090 \
                      --backend tensorrt \
                      --trt_model_path "/app/TensorRT-LLM-examples/whisper/whisper_small_float16" \
                      --trt_py_session