Transformers backend integration in SGLang

Hugging Face transformers library is the usual for working with state-of-the-art models — from experimenting with cutting-edge research to fine-tuning on custom data. Its simplicity, flexibility, and expansive model zoo make it a strong tool for rapid development.

But when you’re able to move from notebooks to production, inference performance becomes mission-critical. That’s where SGLang is available in.

Designed for high-throughput, low-latency inference, SGLang now offers seamless integration with transformers as a backend. This implies you may pair the flexibleness of transformers with the raw performance of SGLang.

Let’s dive into what this integration enables and the way you should use it.

SGLang now supports Hugging Face transformers as a backend, letting you run any transformers-compatible model with high-performance inference out of the box.

import sglang as sgl

llm = sgl.Engine("meta-llama/Llama-3.2-1B-Instruct", impl="transformers")
print(llm.generate(["The capital of France is"], {"max_new_tokens": 20})[0])

No native support needed — SGLang mechanically falls back to Transformers when needed, or you may set impl="transformers" explicitly.

Let’s walk through a straightforward text generation example with meta-llama/Llama-3.2-1B-Instruct to match each approaches.

Transformers

transformers library is great for experimentation, small-scale tasks and training, but it surely’s not optimized for high-volume or low-latency scenarios.

from transformers import pipeline

pipe = pipeline("text-generation", model="meta-llama/Llama-3.2-1B-Instruct")
generate_kwargs = {
    "top_p": 0.95,
    "top_k": 20,
    "temperature": 0.8,
    "max_new_tokens": 256
}
result = pipe("The longer term of AI is", **generate_kwargs)
print(result[0]["generated_text"])

SGLang

SGLang takes a special track, prioritizing efficiency with features like RadixAttention (a memory-efficient attention mechanism). Inference with SGLang is noticeably faster and more resource-efficient, especially under load. Here’s the identical task in SGlang using an offline engine:

import sglang as sgl

if __name__ == '__main__':
    llm = sgl.Engine(model_path="meta-llama/Llama-3.2-1B-Instruct")
    prompts = ["The future of AI is"]
    sampling_params =  {
        "top_p": 0.95,
        "top_k": 20,
        "temperature": 0.8,
        "max_new_tokens": 256
    }
    outputs = llm.generate(prompts, sampling_params)
    print(outputs[0])

Or you may spin a server and send requests:

python3 -m sglang.launch_server 
  --model-path meta-llama/Llama-3.2-1B-Instruct 
  --host 0.0.0.0 
  --port 30000

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "The longer term of AI is",
        "sampling_params": {
            "top_p": 0.95,
            "top_k": 20,
            "temperature": 0.8,
            "max_new_tokens": 256
        },
    },
)
print(response.json())

Note that SGLang also offers an OpenAI-compatible API, making it a drop-in substitute for external services.

With the brand new transformers backend integration, SGLang can now mechanically fall back to using transformers models it doesn’t natively support. This implies in practice:

Easy access to recent models added to transformers
Support for custom models from the Hugging Face Hub
Less engineering overhead

This unlocks faster inference and optimized deployment (e.g enabling RadixAttention) without sacrificing the simplicity and flexibility of transformers ecosystem.

Usage

llm = sgl.Engine(model_path="meta-llama/Llama-3.2-1B-Instruct", impl="transformers")

Note that specifying the impl parameter is optional. If the model shouldn’t be natively supported by SGLang, it switches to transformers implementation by itself.

Any model on the Hugging Face Hub that works with transformers using trust_remote_code=True and properly implements attention is compatible with SGLang. You could find the precise requirements within the official documentation. In case your custom model meets these criteria, all you want to do is ready trust_remote_code=True when loading it.

llm = sgl.Engine(model_path="new-custom-transformers-model", impl="transformers", trust_remote_code=True)

Example

Kyutai Team’s Helium isn’t yet natively supported by SGLang. That is where transformers backend shines, enabling optimized inference without waiting for native support.

python3 -m sglang.launch_server 
  --model-path kyutai/helium-1-preview-2b 
  --impl transformers 
  --host 0.0.0.0 
  --port 30000

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "top_p": 0.95,
            "top_k": 20,
            "temperature": 0.8,
            "max_new_tokens": 256
        },
    },
)
print(response.json())

There are several key areas we’re actively working on to boost this integration:

Performance Improvements: transformer models currently lag behind the native integration by way of performance.Our primary objective is to optimize and narrow this gap.
LoRA Support
VLM Integration: we’re also working toward adding support for Vision-Language Models (VLM) to broaden the range of capabilities and use cases.

Source link

Transformers backend integration in SGLang

Transformers

SGLang

Usage

Example

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Efficient Controllable Generation for SDXL with T2I-Adapters

SafeCoder vs. Closed-source Code Assistants

Overview of natively supported quantization schemes in 🤗 Transformers

The Hidden Opportunity in AI Workflow Automation with n8n for Low-Tech Corporations

Fast Diffusion for Image Generation

Transformers backend integration in SGLang

Transformers

SGLang

Usage

Example

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.