Speech-to-Speech (S2S) is an exciting latest project from Hugging Face that mixes several advanced models to create a seamless, almost magical experience: you speak, and the system responds with a synthesized voice.
The project implements a cascaded pipeline leveraging models available through the Transformers library on the Hugging Face hub. The pipeline consists of the next components:
- Voice Activity Detection (VAD)
- Speech to Text (STT)
- Language Model (LM)
- Text to Speech (TTS)
What’s more, S2S has multi-language support! It currently supports English, French, Spanish, Chinese, Japanese, and Korean. You may run the pipeline in single-language mode or use the auto flag for automatic language detection. Take a look at the repo for more details here.
> 👩🏽💻: That is all amazing, but how do I run S2S?
> 🤗: Great query!
Running Speech-to-Speech requires significant computational resources. Even on a high-end laptop you may encounter latency issues, particularly when using probably the most advanced models. While a strong GPU can mitigate these problems, not everyone has the means (or desire!) to establish their very own hardware.
That is where Hugging Face’s Inference Endpoints (IE) come into play. Inference Endpoints permit you to rent a virtual machine equipped with a GPU (or other hardware you may need) and pay just for the time your system is running, providing a perfect solution for deploying performance-heavy applications like Speech-to-Speech.
On this blog post, we’ll guide you step-by-step to deploy Speech-to-Speech to a Hugging Face Inference Endpoint. That is what we’ll cover:
- Understanding Inference Endpoints and a fast overview of the several ways to setup IE, including a custom container image (which is what we’ll need for S2S)
- Constructing a custom docker image for S2S
- Deploying the custom image to IE and having some fun with S2S!
Inference Endpoints
Inference Endpoints provide a scalable and efficient technique to deploy machine learning models. These endpoints permit you to serve models with minimal setup, leveraging quite a lot of powerful hardware. Inference Endpoints are perfect for deploying applications that require high performance and reliability, without the necessity to manage underlying infrastructure.
Here’s just a few key features, and remember to take a look at the documentation for more:
- Simplicity – You may be up and running in minutes because of IE’s direct support of models within the Hugging Face hub.
- Scalability – You haven’t got to fret about scale, since IE scales robotically, including to zero, with the intention to handle various loads and save costs.
- Customization: You may customize the setup of your IE to handle latest tasks. More on this below.
Inference Endpoints supports all the Transformers and Sentence-Transformers tasks, but may also support custom tasks. These are the IE setup options:
- Pre-built Models: Quickly deploy models directly from the Hugging Face hub.
- Custom Handlers: Define custom inference logic for more complex pipelines.
- Custom Docker Images: Use your individual Docker images to encapsulate all dependencies and custom code.
For less complicated models, options 1 and a pair of are ideal and make deploying with Inference Endpoints super straightforward. Nonetheless, for a posh pipeline like S2S, you will want the flexibleness of option 3: deploying our IE using a custom Docker image.
This method not only provides more flexibility but additionally improved performance by optimizing the construct process and gathering obligatory data. Should you’re coping with complex model pipelines or need to optimize your application deployment, this guide will offer useful insights.
Deploying Speech-to-Speech on Inference Endpoints
Let’s get into it!
Constructing the custom Docker image
To start making a custom Docker image, we began by cloning Hugging Face’s default Docker image repository. This serves as a fantastic start line for deploying machine learning models in inference tasks.
git clone https://github.com/huggingface/huggingface-inference-toolkit
Why Clone the Default Repository?
- Solid Foundation: The repository provides a pre-optimized base image designed specifically for inference workloads, which supplies a reliable start line.
- Compatibility: Because the image is built to align with Hugging Face’s deployment environment, this ensures smooth integration once you deploy your individual custom image.
- Ease of Customization: The repository offers a clean and structured environment, making it easy to customize the image for the particular requirements of your application.
You may take a look at all of our changes here
Customizing the Docker Image for the Speech-to-Speech Application
With the repository cloned, the subsequent step was tailoring the image to support our Speech-to-Speech pipeline.
- Adding the Speech-to-Speech Project
To integrate the project easily, we added the speech-to-speech codebase and any required datasets as submodules. This approach offers higher version control, ensuring the precise version of the code and data is at all times available when the Docker image is built.
By including data directly throughout the Docker container, we avoid having to download it every time the endpoint is instantiated, which significantly reduces startup time and ensures the system is reproducible. The information is stored in a Hugging Face repository, which provides easy tracking and versioning.
git submodule add https://github.com/huggingface/speech-to-speech.git
git submodule add https://huggingface.co/andito/fast-unidic
- Optimizing the Docker Image
Next, we modified the Dockerfile to suit our needs:
- Streamlining the Image: We removed packages and dependencies that weren’t relevant to our use case. This reduces the image size and cuts down on unnecessary overhead during inference.
- Installing Requirements: We moved the installation of
requirements.txtfrom the entry point to the Dockerfile itself. This manner, the dependencies are installed when constructing the Docker image, speeding up deployment since these packages won’t must be installed at runtime.
- Deploying the Custom Image
Once the modifications were in place, we built and pushed the custom image to Docker Hub:
DOCKER_DEFAULT_PLATFORM="linux/amd64" docker construct -t speech-to-speech -f dockerfiles/pytorch/Dockerfile .
docker tag speech-to-speech andito/speech-to-speech:latest
docker push andito/speech-to-speech:latest
With the Docker image built and pushed, it’s able to be utilized in the Hugging Face Inference Endpoint. Through the use of this pre-built image, the endpoint can launch faster and run more efficiently, as all dependencies and data are pre-packaged throughout the image.
Establishing an Inference Endpoint
Using a custom docker image just requires a rather different configuration, be happy to ascertain out the documentation. We are going to walk through the approach to do that in each the GUI and the API.
Pre-Steps
- Login: https://huggingface.co/login
- Request access to meta-llama/Meta-Llama-3.1-8B-Instruct
- Create a Fantastic-Grained Token: https://huggingface.co/settings/tokens/latest?tokenType=fineGrained
- Select access to gated repos
- Should you are using the API make certain to pick permissions to Manage Inference Endpoints
Inference Endpoints GUI
- Navigate to https://ui.endpoints.huggingface.co/latest
- Fill within the relevant information
- Model Repository –
andito/s2s - Model Name – Be happy to rename should you don’t love the generated name
- e.g.
speech-to-speech-demo - Keep it lower-case and short
- e.g.
- Select your chosen Cloud and Hardware – We used
AWSGPUL4- It’s only
$0.80an hour and is large enough to handle the models
- It’s only
- Advanced Configuration (click the expansion arrow ➤)
- Container Type –
Custom - Container Port –
80 - Container URL –
andito/speech-to-speech:latest - Secrets –
HF_TOKEN|
- Container Type –
- Model Repository –
Click to point out images
3. Click `Create Endpoint`
The Model Repository doesn’t actually matter because the models are specified and downloaded within the container creation, but Inference Endpoints requires a model, so be happy to choose a slim considered one of your alternative.
You might want to specify
HF_TOKENbecause we’d like to download gated models within the container creation stage. This may not be obligatory should you use models that are not gated or private.
The present huggingface-inference-toolkit entrypoint uses port 5000 as default, however the inference endpoint expects port 80. You need to match this within the Container Port. We already set it in our dockerfile, but beware if making your individual from scratch!
Inference Endpoints API
Here we’ll walk through the steps for creating the endpoint with the API. Just use this code in your python environment of alternative.
Be sure that to make use of 0.25.1 or greater
pip install huggingface_hub>=0.25.1
Use a token that may write an endpoint (Write or Fantastic-Grained)
from huggingface_hub import login
login()
from huggingface_hub import create_inference_endpoint, get_token
endpoint = create_inference_endpoint(
"speech-to-speech-demo",
repository="andito/s2s",
framework="custom",
task="custom",
type="protected",
vendor="aws",
accelerator="gpu",
region="us-east-1",
instance_size="x1",
instance_type="nvidia-l4",
custom_image={
"health_route": "/health",
"url": "andito/speech-to-speech:latest",
"port": 80
},
secrets={'HF_TOKEN': get_token()}
)
endpoint.wait()
Overview
Major Componants
- Speech To Speech
- It is a Hugging Face Library, we put some inference-endpoint specific files within the
inference-endpointbranch which might be merged to fundamental soon.
- It is a Hugging Face Library, we put some inference-endpoint specific files within the
- andito/s2s or another repository. This will not be needed for us since we’ve got the models within the container creation stage, however the inference-endpoint requires a model, so we pass a repository that’s slim.
- andimarafioti/speech-to-speech-toolkit.
Constructing the webserver
To make use of the endpoint, we’ll need to construct a small webservice. The code for it is completed on s2s_handler.py within the speech_to_speech library which we use for the client and webservice_starlette.py within thespeech_to_speech_inference_toolkit which we used to construct the docker image. Normally, you’d only have a custom handler for an endpoint, but since we wish to have a extremely low latency, we also built the webservice to support websocket connections as an alternative of normal requests. This sounds intimidating at first, however the webservice is just 32 lines of code!
This code will run prepare_handler on startup, which is able to initialize all of the models and warm them up. Then, each message might be processed by inference_handler.process_streaming_data
This method simply receives the audio data from the client, chunks it into small parts for the VAD, and submits it to a queue for processing. Then it checks the output processing queue (the spoken response from the model!) and returns it if there’s something. All of the interior processing is handled by Hugging Face’s speech_to_speech library.
Custom handler custom client
The webservice receives and returns audio. But there remains to be a giant missing piece, how will we record and play back the audio? For that, we created a client that connects to the service. The best is to divide the evaluation within the connection to the webservice and the recording/playing of audio.
Initializing the webservice client requires setting a header for all messages with our Hugging Face Token. When initializing the client, we set what we wish to do on common messages (open, close, error, message). It will determine what our client does when the server sends it messages.
We are able to see that the reactions to the messages are uncomplicated, with the on_message being the one method with more complexity. This method understands when the server is completed responding and starts ‘listening’ back to the user. Otherwise, it puts the information from the server within the playback queue.
The client’s audio section has 4 tasks:
- Record the audio
- Submit the audio recordings
- Receive the audio responses from the server
- Playback the audio responses
The audio is recorded on the audio_input_callback method, it simply submits all chunks to a queue. Then, it is distributed to the server with the send_audio method. Here, if there is no such thing as a audio to send, we still submit an empty array with the intention to receive a response from the server. The responses from the server are handled by the on_message method we saw earlier within the blog. Then, the playback of the audio responses are handled by the audio_output_callback method. Here we only have to make sure that the audio is within the range we expect (We don’t need to destroy someone eardrum’s due to a faulty package!) and make sure that the scale of the output array is what the playback library expects.
Conclusion
On this post, we walked through the steps of deploying the Speech-to-Speech (S2S) pipeline on Hugging Face Inference Endpoints using a custom Docker image. We built a custom container to handle the complexities of the S2S pipeline and demonstrated find out how to configure it for scalable, efficient deployment. Hugging Face Inference Endpoints make it easier to bring performance-heavy applications like Speech-to-Speech to life, without the effort of managing hardware or infrastructure.
Should you’re concerned with trying it out or have any questions, be happy to explore the next resources:
Have issues or questions? Open a discussion on the relevant GitHub repository, and we’ll be completely satisfied to assist!

