ZeroGPU lets anyone spin up powerful Nvidia H200 hardware in Hugging Face Spaces without keeping a GPU locked for idle traffic.
It’s efficient, flexible, and ideal for demos nevertheless it doesn’t at all times make full use of every part the GPU and CUDA stack can offer.
Generating images or videos can take a big period of time. Having the ability to squeeze out more performance, profiting from the H200 hardware, does matter on this case.
That is where PyTorch ahead-of-time (AoT) compilation is available in. As an alternative of compiling models on the fly (which doesn’t play nicely with ZeroGPU’s short-lived processes), AoT enables you to optimize once and reload immediately.
The result: snappier demos and a smoother experience, with speedups starting from 1.3×–1.8× on models like Flux, Wan, and LTX 🔥
On this post, we’ll show the way to wire up Ahead-of-Time (AoT) compilation in ZeroGPU Spaces. We’ll explore advanced tricks like FP8 quantization and dynamic shapes, and share working demos you’ll be able to try straight away. For those who cannot wait, we invite you to envision out some ZeroGPU-powered demos on the zerogpu-aoti organization.
Pro users and Team / Enterprise org members can create ZeroGPU Spaces, while anyone can freely use them (Pro, Team and Enterprise users get 8x more ZeroGPU quota)
Table of Contents
What’s ZeroGPU
Spaces is a platform powered by Hugging Face that permits ML practitioners to simply publish demo apps.
Typical demo apps on Spaces appear like:
import gradio as gr
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained(...).to('cuda')
def generate(prompt):
return pipe(prompt).images
gr.Interface(generate, "text", "gallery").launch()
This works great, but finally ends up reserving a GPU for the Space during its entire lifetime – even when it has no user activity.
When executing .to('cuda') on this line:
pipe = DiffusionPipeline.from_pretrained(...).to('cuda')
PyTorch initializes the NVIDIA driver, which sets up the method on CUDA endlessly. This just isn’t very resource-efficient on condition that app traffic just isn’t perfectly smooth, but is fairly extremely sparse and spiky.
ZeroGPU takes a just-in-time approach to GPU initialization. As an alternative of establishing the fundamental process on CUDA, it mechanically forks the method, sets it up on CUDA, runs the GPU tasks, and eventually kills the fork when the GPU must be released.
Which means that:
- When the app doesn’t receive traffic, it doesn’t use any GPU
- When it is definitely performing a task, it can use one GPU
- It may well use multiple GPUs as needed to perform tasks concurrently
Because of the Python spaces package, the one code change needed to get this behaviour is as follows:
import gradio as gr
+ import spaces
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained(...).to('cuda')
+ @spaces.GPU
def generate(prompt):
return pipe(prompt).images
gr.Interface(generate, "text", "gallery").launch()
By importing spaces and adding the @spaces.GPU decorator, we:
- Intercept PyTorch API calls to postpone CUDA operations
- Make the decorated function run in a fork when later called
- (Call an internal API to make the best device visible to the fork but this just isn’t within the scope of this blogpost)
ZeroGPU currently allocates an MIG slice of H200 (
3g.71gbprofile). Additional MIG sizes including full slice (7g.141gbprofile) will are available in late 2025.
PyTorch compilation
Modern ML frameworks like PyTorch and JAX have the concept of compilation that could be used to optimize model latency or inference time. Behind the scenes, compilation applies a series of (often hardware-dependent) optimization steps resembling operator fusion, constant folding, etc.
PyTorch (from 2.0 onwards) currently has two major interfaces for compilation:
- Just-in-time with
torch.compile - Ahead-of-time with
torch.export+AOTInductor
torch.compile works great in standard environments: it compiles your model the primary time it runs, and reuses the optimized version for subsequent calls.
Nevertheless, on ZeroGPU, on condition that the method is freshly spun up for (almost) every GPU task, it implies that torch.compile can’t efficiently re-use compilation and is thus forced to depend on its filesystem cache to revive compiled models. Depending on the model being compiled, this process takes from a couple of dozen seconds to some of minutes, which is way an excessive amount of for practical GPU tasks in Spaces.
That is where ahead-of-time (AoT) compilation shines.
With AoT, we will export a compiled model once, reserve it, and later reload it immediately in any process, which is strictly what we want for ZeroGPU. This helps us reduce framework overhead and in addition eliminates cold-start timings typically incurred in just-in-time compilation.
But how can we do ahead-of-time compilation on ZeroGPU? Let’s dive in.
Ahead-of-time compilation on ZeroGPU
Let’s return to our ZeroGPU base example and unpack what we want to enable AoT compilation. For the aim of this demo, we’ll use the black-forest-labs/FLUX.1-dev model:
import gradio as gr
import spaces
import torch
from diffusers import DiffusionPipeline
MODEL_ID = 'black-forest-labs/FLUX.1-dev'
pipe = DiffusionPipeline.from_pretrained(MODEL_ID, torch_dtype=torch.bfloat16)
pipe.to('cuda')
@spaces.GPU
def generate(prompt):
return pipe(prompt).images
gr.Interface(generate, "text", "gallery").launch()
Within the discussion below, we only compile the
transformercomponent ofpipesince, in these generative models, the transformer (or more generally, the denoiser) is essentially the most computationally heavy component.
Compiling a model ahead-of-time with PyTorch involves multiple steps:
1. Getting example inputs
Recall that we’re going to compile the model ahead of time. Subsequently, we want to derive example inputs for the model. Note that these are the identical sorts of inputs we expect to see in the course of the actual runs. To capture those inputs, we’ll leverage the spaces.aoti_capture helper from the spaces package:
with spaces.aoti_capture(pipe.transformer) as call:
pipe("arbitrary example prompt")
When used as a context manager, aoti_capture intercepts the decision to any callable (pipe.transformer in our case), prevents it from executing, captures the input arguments that may have been passed to it, and stores their values in call.args and call.kwargs.
2. Exporting the model
Now that we’ve got example args and kwargs for our transformer component, we will export it to a PyTorch ExportedProgram through the use of torch.export.export utility:
exported_transformer = torch.export.export(
pipe.transformer,
args=call.args,
kwargs=call.kwargs,
)
An exported PyTorch program is a computation graph that represents the tensor computations together with the unique model parameter values.
3. Compiling the exported model
Once the model is exported, compiling it’s pretty straightforward.
A conventional AoT compilation in PyTorch often requires saving the model on disk so it may possibly be later reloaded. In our case, we’ll leverage a helper function a part of the spaces package: spaces.aoti_compile. It is a tiny wrapper around torch._inductor.aot_compile that manages saving and lazy-loading the model as needed. It’s meant for use like this:
compiled_transformer = spaces.aoti_compile(exported_transformer)
This compiled_transformer is now an AoT-compiled binary able to be used for inference.
4. Using the compiled model within the pipeline
Now we want to bind our compiled transformer to our original pipeline, i.e., the pipeline.
A naive and almost working approach is to easily patch our pipeline like pipe.transformer = compiled_transformer. Unfortunately, this approach doesn’t work since it deletes necessary attributes like dtype, config, etc. Only patching the forward method doesn’t work well either because we’re then keeping original model parameters in memory, often resulting in OOM errors at runtime.
spaces package provides a utility for this, too — spaces.aoti_apply:
spaces.aoti_apply(compiled_transformer, pipe.transformer)
Et voilà ! It is going to handle patching pipe.transformer.forward with our compiled model, in addition to cleansing old model parameters out of memory.
5. Wrapping all of it together
To perform the primary three steps (intercepting input examples, exporting the model, and compiling it with PyTorch inductor), we want an actual GPU. CUDA emulation that you simply get outside of @spaces.GPU function just isn’t enough because compilation is actually hardware-dependent, as an illustration, counting on micro-benchmark runs to tune the generated code. Because of this we want to wrap all of it inside a @spaces.GPU function after which get our compiled model back to the basis of our app. Ranging from our original demo code, this offers:
import gradio as gr
import spaces
import torch
from diffusers import DiffusionPipeline
MODEL_ID = 'black-forest-labs/FLUX.1-dev'
pipe = DiffusionPipeline.from_pretrained(MODEL_ID, torch_dtype=torch.bfloat16)
pipe.to('cuda')
+ @spaces.GPU(duration=1500) # maximum duration allowed during startup
+ def compile_transformer():
+ with spaces.aoti_capture(pipe.transformer) as call:
+ pipe("arbitrary example prompt")
+
+ exported = torch.export.export(
+ pipe.transformer,
+ args=call.args,
+ kwargs=call.kwargs,
+ )
+ return spaces.aoti_compile(exported)
+
+ compiled_transformer = compile_transformer()
+ spaces.aoti_apply(compiled_transformer, pipe.transformer)
@spaces.GPU
def generate(prompt):
return pipe(prompt).images
gr.Interface(generate, "text", "gallery").launch()
With only a dozen lines of additional code, we’ve successfully made our demo quite faster (1.7x faster within the case of FLUX.1-dev).
If you must learn more about AoT compilation, you’ll be able to read PyTorch’s AOTInductor tutorial
Gotchas
Now that we’ve got demonstrated the speedups one can realize under the constraints of operating with ZeroGPUs, we’ll discuss a couple of gotchas that got here up while working with this setup.
Quantization
AoT could be combined with quantization to deliver even greater speedups.
For image and video generation, the FP8 post-training dynamic quantization schemes deliver good speed-quality trade-offs.
Nevertheless, FP8 requires a CUDA compute capability of a minimum of 9.0 to work.
Thankfully, for ZeroGPUs, since they’re based on H200s, we will already benefit from the FP8 quantization schemes.
To enable FP8 quantization inside our AoT compilation workflow, we will leverage the APIs provided by torchao like so:
+ from torchao.quantization import quantize_, Float8DynamicActivationFloat8WeightConfig
+ # Quantize the transformer just before the export step.
+ quantize_(pipe.transformer, Float8DynamicActivationFloat8WeightConfig())
exported_transformer = torch.export.export(
pipe.transformer,
args=call.args,
kwargs=call.kwargs,
)
(You’ll find more details about TorchAO here.)
And we will then proceed with the remaining of the steps as outlined above. Using quantization provides one other 1.2x of speedup.
Dynamic shapes
Images and videos can are available in different sizes and styles. Hence, it’s necessary to also account for shape dynamism when performing AoT compilation. The primitives provided by torch.export.export make it easily configurable to offer what inputs needs to be treated accordingly for dynamic shapes, as shown below.
For the case of Flux.1-Dev transformer, changes in numerous image resolutions will affect two of its forward arguments:
hidden_states: The noisy input latents, which the transformer is imagined to denoise. It’s a 3D tensor, representingbatch_size, flattened_latent_dim, embed_dim. When the batch size is fixed, it’s theflattened_latent_dimthat may change for any changes made to image resolutions.img_ids: A 2D array of encoded pixel coordinates having a shape ofheight * width, 3. On this case, we have the desire to makeheight * widthdynamic.
We start by defining a variety during which we wish to let the (latent) image resolutions vary.
To derive these value ranges, we inspected the shapes of hidden_states within the pipeline with respect to varied image resolutions. The precise values are model-dependent and require manual inspection and a few intuition. For Flux.1-Dev, we ended up with:
transformer_hidden_dim = torch.export.Dim('hidden', min=4096, max=8212)
We then define a map of argument names and which dimensions of their input values we expect to be dynamic:
transformer_dynamic_shapes = {
"hidden_states": {1: transformer_hidden_dim},
"img_ids": {0: transformer_hidden_dim},
}
Then we want to make our dynamic shapes object replicate the structure of our example inputs. The inputs that don’t need dynamic shapes should be set to None. This could be done very easily with PyTorch tree_map utility:
from torch.utils._pytree import tree_map
dynamic_shapes = tree_map(lambda v: None, call.kwargs)
dynamic_shapes |= transformer_dynamic_shapes
Now, when performing the export step, we simply supply transformer_dynamic_shapes to torch.export.export:
exported_transformer = torch.export.export(
pipe.transformer,
args=call.args,
kwargs=call.kwargs,
dynamic_shapes=dynamic_shapes,
)
Try this Space that shows the way to use each quantization and dynamic shapes in the course of the export step.
Multi-compile / shared weights
Dynamic shapes is typically not enough when dynamism is just too necessary.
That is, as an illustration, the case with the Wan family of video generation models if you happen to want your compiled model to generate different resolutions.
One thing could be done on this case: compile one model per resolution while keeping the model parameters shared and dispatching the best one at runtime
Here’s a minimal example of this approach: zerogpu-aoti-multi.py. You can even see a completely working implementation of this paradigm within the Wan 2.2 Space.
FlashAttention-3
For the reason that ZeroGPU hardware and CUDA drivers are perfectly compatible with Flash-Attention 3 (FA3), we will use it in our ZeroGPU Spaces to hurry things up even further. FA3 works with ahead-of-time compilation. So, this is right for our case.
Compiling and constructing FA3 from source can take several minutes, and this process is hardware-dependent. As users, we wouldn’t wish to lose precious ZeroGPU compute hours. That is where Hugging Face kernels library involves the rescue. It provides access to pre-built kernels which might be compatible for a given hardware. For instance, after we attempt to run:
from kernels import get_kernel
vllm_flash_attn3 = get_kernel("kernels-community/vllm-flash-attn3")
It tries to load a kernel from the kernels-community/vllm-flash-attn3 repository, which is compatible with the present setup. Otherwise, it can error out on account of incompatibility issues. Luckily for us, this works seamlessly on the ZeroGPU Spaces. This implies we will leverage the ability of FA3 on ZeroGPU, using the kernels library.
Here’s a fully working example of an FA3 attention processor for the Qwen-Image model.
Regional compilation
Thus far, we’ve got been compiling the total model. Depending on the model, full model compilation can result in significantly long cold start times. Long cold start times make the event experience unpleasant.
We also can decide to compile regions inside a model, significantly reducing the cold start times, while retaining just about all the advantages of full model compilation. Regional compilation becomes promising when
a model has repeated blocks of computation. An ordinary language model, for instance, has numerous
identically structured Transformer blocks.
In our example, we will compile the repeated blocks of the Flux transformer ahead of time, and propagate the compiled graph to the remaining repeated blocks. The Flux Transformer has two sorts of repeated blocks: FluxTransformerBlock and FluxSingleTransformerBlock.
You’ll be able to take a look at this Space for a whole example.
💡 For Flux.1-Dev, switching to regional compilation reduces the compilation time from 6 minutes to simply 30 seconds while delivering similar speedups.
Use a compiled graph from the Hub
Once a model (or perhaps a model block) is compiled ahead of time, we will serialize the compiled graph module
as an artifact and reuse later. Within the context of a ZeroGPU-powered demo on Spaces, it will significantly
cut down the demo startup time by skipping the compilation time.
To maintain the storage light, we will just save the compiled model graph without including any model parameters
contained in the artifact.
Try this collection that shows a full workflow of obtaining compiled model graph, pushing it
to the Hub, after which using it to construct a demo.
AoT compiled ZeroGPU Spaces demos
Speedup comparison
Featured AoTI Spaces
Regional compilation
Conclusion
ZeroGPU inside Hugging Face Spaces is a strong feature that permits AI builders by providing access to powerful compute. On this post, we showed how users can profit from PyTorch’s ahead-of-time compilation techniques to hurry up their applications that leverage ZeroGPU.
We display speedups with Flux.1-Dev, but these techniques should not limited to simply this model. Subsequently, we encourage you to offer these techniques a attempt to provide us with feedback on this community discussion.
Resources
- Visit our ZeroGPU-AOTI org on the Hub to confer with a group of demos that leverage the techniques discussed on this post.
- Browse
spaces.aoti_*APIs source code to learn more concerning the interface - Try Kernels Community org on the hub
- Learn more about regional compilation from here
- Upgrade to Pro on Hugging Face to create your individual ZeroGPU Spaces (and get 25 minutes of H200 usage every single day)
Acknowledgements: Because of ChunTe Lee for creating an awesome thumbnail for this post. Because of Pedro and Vaibhav for providing feedback on the post. Because of Angela Yi from the PyTorch team for helping us with AOT guidance.
