Hugging Face provides a Hub platform that permits you to upload, share, and deploy your models with ease. It saves developers the time and computational resources required to coach models from scratch. Nonetheless, deploying models in a real-world production environment or in a cloud-native way can still present challenges.
That is where BentoML comes into the image. BentoML is an open-source platform for machine learning model serving and deployment. It’s a unified framework for constructing, shipping, and scaling production-ready AI applications incorporating traditional, pre-trained, and generative models in addition to Large Language Models. Here is how you employ the BentoML framework from a high-level perspective:
- Define a model: Before you should utilize BentoML, you would like a machine learning model (or multiple models). This model will be trained using a machine learning library similar to TensorFlow and PyTorch.
- Save the model: Once you have got a trained model, put it aside to the BentoML local Model Store, which is used for managing all of your trained models locally in addition to accessing them for serving.
- Create a BentoML Service: You create a
service.pyfile to wrap the model and define the serving logic. It specifies Runners for models to run model inference at scale and exposes APIs to define the best way to process inputs and outputs. - Construct a Bento: By making a configuration YAML file, you package all of the models and the Service right into a Bento, a deployable artifact containing all of the code and dependencies.
- Deploy the Bento: Once the Bento is prepared, you’ll be able to containerize the Bento to create a Docker image and run it on Kubernetes. Alternatively, deploy the Bento on to Yatai, an open-source, end-to-end solution for automating and running machine learning deployments on Kubernetes at scale.
On this blog post, we are going to exhibit the best way to integrate DeepFloyd IF with BentoML by following the above workflow.
Table of contents
A temporary introduction to DeepFloyd IF
DeepFloyd IF is a state-of-the-art, open-source text-to-image model. It stands other than latent diffusion models like Stable Diffusion because of its distinct operational strategy and architecture.
DeepFloyd IF delivers a high degree of photorealism and complex language understanding. Unlike Stable Diffusion, DeepFloyd IF works directly in pixel space, leveraging a modular structure that encompasses a frozen text encoder and three cascaded pixel diffusion modules. Each module plays a singular role in the method: Stage 1 is liable for the creation of a base 64×64 px image, which is then progressively upscaled to 1024×1024 px across Stage 2 and Stage 3. One other critical aspect of DeepFloyd IF’s uniqueness is its integration of a Large Language Model (T5-XXL-1.1) to encode prompts, which offers superior understanding of complex prompts. For more information, see this Stability AI blog post about DeepFloyd IF.
To make certain your DeepFloyd IF application runs in high performance in production, you could wish to allocate and manage your resources properly. On this respect, BentoML permits you to scale the Runners independently for every Stage. For instance, you should utilize more Pods to your Stage 1 Runners or allocate more powerful GPU servers to them.
Preparing the environment
This GitHub repository stores all vital files for this project. To run this project locally, make certain you have got the next:
- Python 3.8+
pipinstalled- A minimum of 2x16GB VRAM GPU or 1×40 VRAM GPU. For this project, we used a machine of type
n1-standard-16from Google Cloud plus 64 GB of RAM and a pair of NVIDIA T4 GPUs. Note that while it is feasible to run IF on a single T4, it will not be really useful for production-grade serving
Once the prerequisites are met, clone the project repository to your local machine and navigate to the goal directory.
git clone https://github.com/bentoml/IF-multi-GPUs-demo.git
cd IF-multi-GPUs-demo
Before constructing the applying, let’s briefly explore the important thing files inside this directory:
import_models.py: Defines the models for every stage of theIFPipeline. You employ this file to download all of the models to your local machine so you could package them right into a single Bento.requirements.txt: Defines all of the packages and dependencies required for this project.service.py: Defines a BentoML Service, which accommodates three Runners created using theto_runnermethod and exposes an API for generating images. The API takes a JSON object as input (i.e. prompts and negative prompts) and returns a picture as output through the use of a sequence of models.start-server.py: Starts a BentoML HTTP server through the Service defined inservice.pyand creates a Gradio web interface for users to enter prompts to generate images.bentofile.yaml: Defines the metadata of the Bento to be built, including the Service, Python packages, and models.
We recommend you create a Virtual Environment for dependency isolation. For instance, run the next command to activate myenv:
python -m venv venv
source venv/bin/activate
Install the required dependencies:
pip install -r requirements.txt
For those who haven’t previously downloaded models from Hugging Face using the command line, you could log in first:
pip install -U huggingface_hub
huggingface-cli login
Downloading the model to the BentoML Model Store
As mentioned above, it’s essential download all of the models utilized by each DeepFloyd IF stage. Once you have got arrange the environment, run the next command to download models to your local Model store. The method may take a while.
python import_models.py
Once the downloads are complete, view the models within the Model store.
$ bentoml models list
Tag Module Size Creation Time
sd-upscaler:bb2ckpa3uoypynry bentoml.diffusers 16.29 GiB 2023-07-06 10:15:53
if-stage2:v1.0 bentoml.diffusers 13.63 GiB 2023-07-06 09:55:49
if-stage1:v1.0 bentoml.diffusers 19.33 GiB 2023-07-06 09:37:59
Starting a BentoML Service
You may directly run the BentoML HTTP server with an online UI powered by Gradio using the start-server.py file, which is the entry point of this application. It provides various options for customizing the execution and managing GPU allocation amongst different Stages. You could use different commands depending in your GPU setup:
-
For a GPU with over 40GB VRAM, run all models on the identical GPU.
python start-server.py -
For 2 Tesla T4 with 15GB VRAM each, assign the Stage 1 model to the primary GPU, and the Stage 2 and Stage 3 models to the second GPU.
python start-server.py --stage1-gpu=0 --stage2-gpu=1 --stage3-gpu=1 -
For one Tesla T4 with 15GB VRAM and two additional GPUs with smaller VRAM size, assign the Stage 1 model to T4, and Stage 2 and Stage 3 models to the second and third GPUs respectively.
python start-server.py --stage1-gpu=0 --stage2-gpu=1 --stage3-gpu=2
To see all customizable options (just like the server’s port), run:
python start-server.py --help
Testing the server
Once the server starts, you’ll be able to visit the online UI at http://localhost:7860. The BentoML API endpoint can also be accessible at http://localhost:3000. Here is an example of a prompt and a negative prompt.
Prompt:
orange and black, head shot of a girl standing under street lights, dark theme, Frank Miller, cinema, ultra realistic, ambiance, insanely detailed and complex, hyper realistic, 8k resolution, photorealistic, highly textured, intricate details
Negative prompt:
tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, mutation, mutated, extra limbs, extra legs, extra arms, disfigured, deformed, cross-eye, body out of frame, blurry, bad art, bad anatomy, blurred, text, watermark, grainy
Result:
Constructing and serving a Bento
Now that you have got successfully run DeepFloyd IF locally, you’ll be able to package it right into a Bento by running the next command within the project directory.
$ bentoml construct
Converting 'IF-stage1' to lowercase: 'if-stage1'.
Converting 'IF-stage2' to lowercase: 'if-stage2'.
Converting DeepFloyd-IF to lowercase: deepfloyd-if.
Constructing BentoML service "deepfloyd-if:6ufnybq3vwszgnry" from construct context "/Users/xxx/Documents/github/IF-multi-GPUs-demo".
Packing model "sd-upscaler:bb2ckpa3uoypynry"
Packing model "if-stage1:v1.0"
Packing model "if-stage2:v1.0"
Locking PyPI package versions.
██████╗░███████╗███╗░░██╗████████╗░█████╗░███╗░░░███╗██╗░░░░░
██╔══██╗██╔════╝████╗░██║╚══██╔══╝██╔══██╗████╗░████║██║░░░░░
██████╦╝█████╗░░██╔██╗██║░░░██║░░░██║░░██║██╔████╔██║██║░░░░░
██╔══██╗██╔══╝░░██║╚████║░░░██║░░░██║░░██║██║╚██╔╝██║██║░░░░░
██████╦╝███████╗██║░╚███║░░░██║░░░╚█████╔╝██║░╚═╝░██║███████╗
╚═════╝░╚══════╝╚═╝░░╚══╝░░░╚═╝░░░░╚════╝░╚═╝░░░░░╚═╝╚══════╝
Successfully built Bento(tag="deepfloyd-if:6ufnybq3vwszgnry").
View the Bento within the local Bento Store.
$ bentoml list
Tag Size Creation Time
deepfloyd-if:6ufnybq3vwszgnry 49.25 GiB 2023-07-06 11:34:52
The Bento is now ready for serving in production.
bentoml serve deepfloyd-if:6ufnybq3vwszgnry
To deploy the Bento in a more cloud-native way, generate a Docker image by running the next command:
bentoml containerize deepfloyd-if:6ufnybq3vwszgnry
You may then deploy the model on Kubernetes.
What’s next?
BentoML provides a strong and simple technique to deploy Hugging Face models for production. With its support for a big selection of ML frameworks and easy-to-use APIs, you’ll be able to ship your model to production very quickly. Whether you’re working with the DeepFloyd IF model or every other model on the Hugging Face Model Hub, BentoML can show you how to bring your models to life.
Take a look at the next resources to see what you’ll be able to construct with BentoML and its ecosystem tools, and stay tuned for more details about BentoML.

