Deploy MusicGen very quickly with Inference Endpoints

MusicGen is a strong music generation model that takes in text prompt and an optional melody to output music. This blog post will guide you thru generating music with MusicGen using Inference Endpoints.

Inference Endpoints allow us to write down custom inference functions called custom handlers. These are particularly useful when a model just isn’t supported out-of-the-box by the transformers high-level abstraction pipeline.

transformers pipelines offer powerful abstractions to run inference with transformers-based models. Inference Endpoints leverage the pipeline API to simply deploy models with only just a few clicks. Nonetheless, Inference Endpoints will also be used to deploy models that haven’t got a pipeline, and even non-transformer models! That is achieved using a custom inference function that we call a custom handler.

Let’s reveal this process using MusicGen for example. To implement a custom handler function for MusicGen and deploy it, we are going to have to:

Duplicate the MusicGen repository we wish to serve,
Write a custom handler in handler.py and any dependencies in requirements.txt and add them to the duplicated repository,
Create Inference Endpoint for that repository.

Or just use the end result and deploy our custom MusicGen model repo, where we just followed the steps above 🙂

Let’s go!

First, we are going to duplicate the facebook/musicgen-large repository to our own profile using repository duplicator.

Then, we are going to add handler.py and requirements.txt to the duplicated repository.
First, let’s take a take a look at tips on how to run inference with MusicGen.

from transformers import AutoProcessor, MusicgenForConditionalGeneration

processor = AutoProcessor.from_pretrained("facebook/musicgen-large")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-large")

inputs = processor(
    text=["80s pop track with bassy drums and synth"],
    padding=True,
    return_tensors="pt",
)
audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=256)

Let’s hear what it appears like:

Optionally, you may also condition the output with an audio snippet i.e. generate a complimentary snippet which mixes the text generated audio with an input audio.

from transformers import AutoProcessor, MusicgenForConditionalGeneration
from datasets import load_dataset

processor = AutoProcessor.from_pretrained("facebook/musicgen-large")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-large")

dataset = load_dataset("sanchit-gandhi/gtzan", split="train", streaming=True)
sample = next(iter(dataset))["audio"]


sample["array"] = sample["array"][: len(sample["array"]) // 2]

inputs = processor(
    audio=sample["array"],
    sampling_rate=sample["sampling_rate"],
    text=["80s blues track with groovy saxophone"],
    padding=True,
    return_tensors="pt",
)
audio_values = model.generate(**inputs, do_sample=True, guidance_scale=3, max_new_tokens=256)

Let’s give it a listen:

In each the cases the model.generate method produces the audio and follows the identical principles as text generation. You possibly can read more about it in our tips on how to generate blog post.

Alright! With the fundamental usage outlined above, let’s deploy MusicGen for fun and profit!

First, we’ll define a custom handler in handler.py. We are able to use the Inference Endpoints template and override the __init__ and __call__ methods with our custom inference code. __init__ will initialize the model and the processor, and __call__ will take the information and return the generated music. You could find the modified EndpointHandler class below. 👇

from typing import Dict, List, Any
from transformers import AutoProcessor, MusicgenForConditionalGeneration
import torch

class EndpointHandler:
    def __init__(self, path=""):
        
        self.processor = AutoProcessor.from_pretrained(path)
        self.model = MusicgenForConditionalGeneration.from_pretrained(path, torch_dtype=torch.float16).to("cuda")

    def __call__(self, data: Dict[str, Any]) -> Dict[str, str]:
        """
        Args:
            data (:dict:):
                The payload with the text prompt and generation parameters.
        """
        
        inputs = data.pop("inputs", data)
        parameters = data.pop("parameters", None)

        
        inputs = self.processor(
            text=[inputs],
            padding=True,
            return_tensors="pt",).to("cuda")

        
        if parameters is not None:
            with torch.autocast("cuda"):
                outputs = self.model.generate(**inputs, **parameters)
        else:
            with torch.autocast("cuda"):
                outputs = self.model.generate(**inputs,)

        
        prediction = outputs[0].cpu().numpy().tolist()

        return [{"generated_audio": prediction}]

To maintain things easy, in this instance we’re only generating audio from text, and never conditioning it with a melody.
Next, we are going to create a requirements.txt file containing all of the dependencies we want to run our inference code:

transformers==4.31.0
speed up>=0.20.3

Uploading these two files to our repository will suffice to serve the model.

We are able to now create the Inference Endpoint. Head to the Inference Endpoints page and click on Deploy your first model. Within the “Model repository” field, enter the identifier of your duplicated repository. Then select the hardware you would like and create the endpoint. Any instance with a minimum of 16 GB RAM should work for musicgen-large.

After creating the endpoint, it’s going to be routinely launched and able to receive requests.

We are able to query the endpoint with the below snippet.

curl URL_OF_ENDPOINT 
-X POST 
-d '{"inputs":"glad folk song, cheerful and vigorous"}' 
-H "Authorization: {YOUR_TOKEN_HERE}" 
-H "Content-Type: application/json"

We are able to see the next waveform sequence as output.

[{"generated_audio":[[-0.024490159,-0.03154691,-0.0079551935,-0.003828604, ...]]}]

Here’s the way it appears like:

You may also hit the endpoint with huggingface-hub Python library’s InferenceClient class.

from huggingface_hub import InferenceClient

client = InferenceClient(model = URL_OF_ENDPOINT)
response = client.post(json={"inputs":"an alt rock song"})


output = eval(response)[0]["generated_audio"]

You possibly can convert the generated sequence to audio nevertheless you would like. You need to use scipy in Python to write down it to a .wav file.

import scipy
import numpy as np


scipy.io.wavfile.write("musicgen_out.wav", rate=32000, data=np.array(output[0]))

And voila!

Play with the demo below to try the endpoint out.

Conclusion

On this blog post, we have now shown tips on how to deploy MusicGen using Inference Endpoints with a custom inference handler. The identical technique might be used for some other model within the Hub that doesn’t have an associated pipeline. All you may have to do is override the Endpoint Handler class in handler.py, and add requirements.txt to reflect your project’s dependencies.

Source link

Deploy MusicGen very quickly with Inference Endpoints

Let’s go!

Conclusion

Read More

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Real-time interactive video diffusion from Overworld

Why it’s critical to maneuver beyond overly aggregated machine-learning metrics

The era of agentic chaos and the way data will save us

Run On-Device LLMs in Apple Devices

Claude Code costs as much as $200 a month. Goose does the identical thing without spending a dime.

Deploy MusicGen very quickly with Inference Endpoints

Let’s go!

Conclusion

Read More

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.