Blazingly fast whisper transcriptions with Inference Endpoints

Published May 13, 2025

Update on GitHub

Upvote

Today we are happy to introduce a new blazing fast OpenAI Whisper deployment option on Inference Endpoints. It provides up to 8x performance improvements compared to the previous version, and makes everyone one click away from deploying dedicated, powerful transcription models in a cost-effective way, leveraging the amazing work done by the AI community.

Through this release, we would like to make Inference Endpoints more community-centric and allow anyone to come and contribute to create incredible inference deployments on the Hugging Face Platform. Along with the community, we would like to propose optimized deployments for a wide range of tasks through the use of awesome and available open-source technologies.

The unique position of Hugging Face, at the heart of the Open-Source AI Community, working hand-in-hand with individuals, institutions and industrial partners, makes it the most heterogeneous platform when it comes to deploying AI models for inference on a wide variety of hardware and software.

Inference Stack

The new Whisper endpoint leverages amazing open-source community projects. Inference is powered by the vLLM project, which provides efficient ways of running AI models on various hardware families – especially, but not limited to, NVIDIA GPUs. We use the vLLM implementation of OpenAI's Whisper model, allowing us to enable further, lower-level optimizations down the software stack.

In this initial release, we are targeting NVIDIA GPUs with compute capabilities 8.9 or better (Ada Lovelace), like L4 & L40s, which unlocks a wide range of software optimizations:

PyTorch compilation (torch.compile)
CUDA graphs
float8 KV cache Compilation with torch.compile generates optimized kernels in a Just-In-Time (JIT) fashion, which can modify the computational graph, reorder operations, call specialized methods, and more.

CUDA graphs record the flow of sequential operations, or kernels, happening on the GPU, and attempts to group them as bigger chunks of work units to execute on the GPU. This grouping operation reduces data movements, synchronizations, and GPU scheduling overhead by executing a single, much larger work unit, rather than multiple smaller ones.

Last but not least, we are dynamically quantizing activations to reduce the memory requirement incurred by the KV cache(s). The computations are done in half precision, bfloat16 in this case, and the outputs are being stored in reduced precision (1 byte for float8 vs 2 bytes for bfloat16) which allows us to store more elements in the KV cache, increasing the cache hit rate.

There are many ways to continue pushing this and we are gearing up to work hand in hand with the community to improve it!

Benchmarks

Whisper Large V3 shows a nearly 8x improvement in RTFx, enabling much faster inference with no loss in transcription quality.

We evaluated the transcription quality and runtime efficiency of several Whisper-based models—Whisper Large V3, Whisper Large V3-Turbo, and Distil-Whisper Large V3.5—and compared them against their implementations on the Transformers library to assess both accuracy and decoding speed under identical conditions.

We computed Word Error Rate (WER) across 8 standard datasets from the Open ASR Leaderboard, including AMI, GigaSpeech, LibriSpeech (Clean and Other), SPGISpeech, Tedlium, VoxPopuli, and Earnings22. These datasets span diverse domains and recording conditions, ensuring a robust evaluation of generalization and real-world transcription quality. WER measures transcription accuracy by calculating the percentage of words that are incorrectly predicted (via insertions, deletions, or substitutions); lower WER indicates better performance. All three Whisper variants maintain WER performance comparable to their Transformer baselines. Word Error Rate comparison

To assess inference efficiency, we sampled from the rev16 long-form dataset, which contains audio segments over 45 minutes in length—representative of real transcription workloads such as meetings, podcasts, or interviews. We measured the Real-Time Factor (RTFx), defined as the ratio of audio duration to transcription time, and averaged it across samples. All models were evaluated in bfloat16 on a single L4 GPU, using consistent decoding settings (language, beam size, and batch size).

How to deploy

You can deploy your own ASR inference pipeline via Hugging Face Endpoints. Endpoints allows anyone willing to deploy AI models into production ready environments to do so by filling in a few parameters. It also features the most complete fleet of AI hardware available on the market to suit your need for cost and performance. All of this directly from where the AI community is being built. To get started, nothing easier, simply choose the model you want to deploy:

Inference

Running inference on the deployed model endpoint can be done in just a few lines of code in Python, you can also use the same structure in Javascript or any other language you're comfortable with.

Here's a small snippet to test the deployed checkpoint quickly.

import requests

ENDPOINT_URL = "https://<your‑hf‑endpoint>.cloud/api/v1/audio/transcriptions"  # 🌐 replace with your URL endpoint
HF_TOKEN     = "hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx"                              # 🔑 replace with your HF token
AUDIO_FILE   = "sample.wav"                                                    # 🔊 path to your local audio file

headers = {"Authorization": f"Bearer {HF_TOKEN}"}

with open(AUDIO_FILE, "rb") as f:
    files = {"file": f.read()}

response = requests.post(ENDPOINT_URL, headers=headers, files=files)
response.raise_for_status()

print("Transcript:", response.json()["text"])

FastRTC Demo

With this blazing fast endpoint, it’s possible to build real-time transcription apps. Try out this example built with FastRTC. Simply speak into your microphone and see your speech transcribed in real time!

Spaces can easily be duplicated so please feel free to duplicate away. All of the above is made available for community use on the Hugging Face Hub in our dedicated HF Endpoints organization. Open issues, suggest use-cases and contribute here: hfendpoints-images (Inference Endpoints Images) 🚀

The New and Fresh analytics in Inference Endpoints

By March 21, 2025 • 21

Getting started with Hugging Face Inference Endpoints

By October 14, 2022 • 1

Community

reach-vb

Article author May 13

Great job! 🔥

clem

May 13

🔥🔥🔥

hlarcher

May 13

Awesome perfs! 👏👏

razhan

May 14

Does it support CrisperWhisper?

mfuntowicz

Article author May 14

I can give it a try in the coming days!

alex-linneman

May 14

Excellent work, everyone! Can this be deployed locally?

mfuntowicz

Article author May 14

If you pull the Docker image locally and have a GPU available on your computer/server, it should yes 👌

vectorventures

May 15

Does this provide word-level timestamps? The last I checked, vLLM did not support timestamps with Whisper. The speed is impressive. Great job!

mfuntowicz

Article author May 15

Thanks @vectorventures ! Word-level timestamps is not yet supported and from my readings, even through OAI platform with Whisper it's not perfect either.

I was looking at the CrisperWhisper @razhan mentionned and it seems to provide some good/better word-level timestamps out of the box, so maybe this is something we can investigate in the coming weeks? would you like to take a look and potentially open a PR for this?

anatal

Jun 6

•

edited Jun 6

VLLM is currently having trouble with RTX 5090 as I write this:

https://github.com/vllm-project/vllm/issues/13306

Can anyone confirm if this service runs on such board?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote