echo-audio / script.md
freddyaboulton's picture
Upload folder using huggingface_hub
45482ae verified

A newer version of the Gradio SDK is available: 5.35.0

Upgrade

Hi, I'm Freddy and I want to give a tour of FastRTC - the real-time communication library for Python.

Why is this important? In the last few months, we've seen many advances in real-time speech and vision models coming from closed-source models, open-source models, and API providers.

Despite these innovations, it's still difficult to build real-time AI applications that stream audio and video, especially in Python. This is because:

  • ML engineers may not have experience with the technologies needed to build real-time applications, such as WebRTC or Websockets.
  • Implementing algorithms for voice detection and turn taking is tricky!
  • Best practices are scattered across various sources and even code assistant tools like Cursor and Copilot struggle to write Python code that supports real-time audio/video applications. I learned that the hard way!

All this means that if you want to take advantage of the latest advances in AI, you have to spend a lot of time figuring out how to do real-time streaming. FastRTC solves this problem by automatically turning any python function into a real-time audio and video stream over WebRTC or WebSockets with little additional code or overhead. Let's see how it works.

Let's start with the basics - echoing audio.

In FastRTC, you can wrap any iterator with ReplyOnPause and pass it to the Stream class.

This will create a WebRTC-powered web server that handles voice detection and turn taking - you just worry about the logic for the generating the response.

Each stream comes with a built-in webRTC-powered Gradio UI that you can use for testing.

Simply call ui.launch(). Let's see it in action.

We can level up our application by having an LLM generate the response.

We'll import the SambaNova API as well as some FastRTC utils for doing speech-to-text and text-to-speech and then pipe them all together.

Importantly, you can use any LLM, speech-to-text, or text-to-speech model. Even an audio-to-audio model. Bring the tools you love and we'll just handle the real-time communication.

You can also call into the stream for FREE if you have a Hugging Face Token.

Finally, deployment is really easy too. You can stick with Gradio or mount the stream in a FastAPI app and build any application you want. By the way, video is supported too!

Thanks for watching! Please visit fastrtc.org to see the cookbook for all the demos shown here as well as the docs.