The Real-Time Communication Library for Python

-


Freddy Boulton's avatar

Abubakar Abid's avatar


In the previous few months, many latest real-time speech models have been released and full corporations have been founded around each open and closed source models. To call a number of milestones:

  • OpenAI and Google released their live multimodal APIs for ChatGPT and Gemini. OpenAI even went up to now as to release a 1-800-ChatGPT phone number!
  • Kyutai released Moshi, a completely open-source audio-to-audio LLM. Alibaba released Qwen2-Audio and Fixie.ai released Ultravox – two open-source LLMs that natively understand audio.
  • ElevenLabs raised $180m of their Series C.

Despite the explosion on the model and funding side, it’s still difficult to construct real-time AI applications that stream audio and video, especially in Python.

  • ML engineers may not have experience with the technologies needed to construct real-time applications, akin to WebRTC.
  • Even code assistant tools like Cursor and Copilot struggle to write down Python code that supports real-time audio/video applications. I do know from experience!

That is why we’re excited to announce FastRTC, the real-time communication library for Python. The library is designed to make it super easy to construct real-time audio and video AI applications entirely in Python!

On this blog post, we’ll walk through the fundamentals of FastRTC by constructing real-time audio applications. At the top, you may understand the core features of FastRTC:

  • 🗣️ Automatic Voice Detection and Turn Taking built-in, so you simply must worry concerning the logic for responding to the user.
  • 💻 Automatic UI – Built-in WebRTC-enabled Gradio UI for testing (or deploying to production!).
  • 📞 Call via Phone – Use fastphone() to get a FREE phone number to call into your audio stream (HF Token required. Increased limits for PRO accounts).
  • ⚡️ WebRTC and Websocket support.
  • 💪 Customizable – You may mount the stream to any FastAPI app so you’ll be able to serve a custom UI or deploy beyond Gradio.
  • 🧰 Numerous utilities for text-to-speech, speech-to-text, stop word detection to get you began.

Let’s dive in.



Getting Began

We’ll start by constructing the “hello world” of real-time audio: echoing back what the user says. In FastRTC, this is so simple as:

from fastrtc import Stream, ReplyOnPause
import numpy as np

def echo(audio: tuple[int, np.ndarray]) -> tuple[int, np.ndarray]:
    yield audio

stream = Stream(ReplyOnPause(echo), modality="audio", mode="send-receive")
stream.ui.launch()

Let’s break it down:

  • The ReplyOnPause will handle the voice detection and switch taking for you. You only need to worry concerning the logic for responding to the user. Any generator that returns a tuple of audio, (represented as (sample_rate, audio_data)) will work.
  • The Stream class will construct a Gradio UI so that you can quickly test out your stream. Once you’ve gotten finished prototyping, you’ll be able to deploy your Stream as a production-ready FastAPI app in a single line of code – stream.mount(app). Where app is a FastAPI app.

Here it’s in motion:



Leveling-Up: LLM Voice Chat

The subsequent level is to make use of an LLM to reply to the user. FastRTC comes with built-in speech-to-text and text-to-speech capabilities, so working with LLMs is de facto easy. Let’s change our echo function accordingly:

import os

from fastrtc import (ReplyOnPause, Stream, get_stt_model, get_tts_model)
from openai import OpenAI

sambanova_client = OpenAI(
    api_key=os.getenv("SAMBANOVA_API_KEY"), base_url="https://api.sambanova.ai/v1"
)
stt_model = get_stt_model()
tts_model = get_tts_model()

def echo(audio):
    prompt = stt_model.stt(audio)
    response = sambanova_client.chat.completions.create(
        model="Meta-Llama-3.2-3B-Instruct",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=200,
    )
    prompt = response.decisions[0].message.content
    for audio_chunk in tts_model.stream_tts_sync(prompt):
        yield audio_chunk

stream = Stream(ReplyOnPause(echo), modality="audio", mode="send-receive")
stream.ui.launch()

We’re using the SambaNova API because it’s fast. The get_stt_model() will fetch Moonshine Base and get_tts_model() will fetch Kokoro from the Hub, each of which have been further optimized for on-device CPU inference. But you should use any LLM/text-to-speech/speech-to-text API or perhaps a speech-to-speech model. Bring the tools you’re keen on – FastRTC just handles the real-time communication layer.



Bonus: Call via Phone

If as a substitute of stream.ui.launch(), you call stream.fastphone(), you may get a free phone number to call into your stream. Note, a Hugging Face token is required. Increased limits for PRO accounts.

You may see something like this in your terminal:

INFO:	  Your FastPhone is now live! Call +1 877-713-4471 and use code 530574 to hook up with your stream.
INFO:	  You've 30:00 minutes remaining in your quota (Resetting on 2025-03-23)

You may then call the number and it’s going to connect you to your stream!



Next Steps

  • Read the docs to learn more concerning the basics of FastRTC.
  • The very best strategy to start constructing is by trying out the cookbook. Discover methods to integrate with popular LLM providers (including OpenAI and Gemini’s real-time APIs), integrate your stream with a FastAPI app and do a custom deployment, return additional data out of your handler, do video processing, and more!
  • ⭐️ Star the repo and file bug and issue requests!
  • Follow the FastRTC Org on HuggingFace for updates and take a look at deployed examples!

Thanks for trying out FastRTC!



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x