Hugging Face transformers library is the usual for working with state-of-the-art models — from experimenting with cutting-edge research to fine-tuning on custom data. Its simplicity, flexibility, and expansive model zoo make it a strong tool for rapid development.
But when you’re able to move from notebooks to production, inference performance becomes mission-critical. That’s where SGLang is available in.
Designed for high-throughput, low-latency inference, SGLang now offers seamless integration with transformers as a backend. This implies you may pair the flexibleness of transformers with the raw performance of SGLang.
Let’s dive into what this integration enables and the way you should use it.
SGLang now supports Hugging Face transformers as a backend, letting you run any transformers-compatible model with high-performance inference out of the box.
import sglang as sgl
llm = sgl.Engine("meta-llama/Llama-3.2-1B-Instruct", impl="transformers")
print(llm.generate(["The capital of France is"], {"max_new_tokens": 20})[0])
No native support needed — SGLang mechanically falls back to Transformers when needed, or you may set impl="transformers" explicitly.
Let’s walk through a straightforward text generation example with meta-llama/Llama-3.2-1B-Instruct to match each approaches.
Transformers
transformers library is great for experimentation, small-scale tasks and training, but it surely’s not optimized for high-volume or low-latency scenarios.
from transformers import pipeline
pipe = pipeline("text-generation", model="meta-llama/Llama-3.2-1B-Instruct")
generate_kwargs = {
"top_p": 0.95,
"top_k": 20,
"temperature": 0.8,
"max_new_tokens": 256
}
result = pipe("The longer term of AI is", **generate_kwargs)
print(result[0]["generated_text"])
SGLang
SGLang takes a special track, prioritizing efficiency with features like RadixAttention (a memory-efficient attention mechanism). Inference with SGLang is noticeably faster and more resource-efficient, especially under load. Here’s the identical task in SGlang using an offline engine:
import sglang as sgl
if __name__ == '__main__':
llm = sgl.Engine(model_path="meta-llama/Llama-3.2-1B-Instruct")
prompts = ["The future of AI is"]
sampling_params = {
"top_p": 0.95,
"top_k": 20,
"temperature": 0.8,
"max_new_tokens": 256
}
outputs = llm.generate(prompts, sampling_params)
print(outputs[0])
Or you may spin a server and send requests:
python3 -m sglang.launch_server
--model-path meta-llama/Llama-3.2-1B-Instruct
--host 0.0.0.0
--port 30000
response = requests.post(
"http://localhost:30000/generate",
json={
"text": "The longer term of AI is",
"sampling_params": {
"top_p": 0.95,
"top_k": 20,
"temperature": 0.8,
"max_new_tokens": 256
},
},
)
print(response.json())
Note that SGLang also offers an OpenAI-compatible API, making it a drop-in substitute for external services.
With the brand new transformers backend integration, SGLang can now mechanically fall back to using transformers models it doesn’t natively support. This implies in practice:
- Easy access to recent models added to transformers
- Support for custom models from the Hugging Face Hub
- Less engineering overhead
This unlocks faster inference and optimized deployment (e.g enabling RadixAttention) without sacrificing the simplicity and flexibility of transformers ecosystem.
Usage
llm = sgl.Engine(model_path="meta-llama/Llama-3.2-1B-Instruct", impl="transformers")
Note that specifying the impl parameter is optional. If the model shouldn’t be natively supported by SGLang, it switches to transformers implementation by itself.
Any model on the Hugging Face Hub that works with transformers using trust_remote_code=True and properly implements attention is compatible with SGLang. You could find the precise requirements within the official documentation. In case your custom model meets these criteria, all you want to do is ready trust_remote_code=True when loading it.
llm = sgl.Engine(model_path="new-custom-transformers-model", impl="transformers", trust_remote_code=True)
Example
Kyutai Team’s Helium isn’t yet natively supported by SGLang. That is where transformers backend shines, enabling optimized inference without waiting for native support.
python3 -m sglang.launch_server
--model-path kyutai/helium-1-preview-2b
--impl transformers
--host 0.0.0.0
--port 30000
response = requests.post(
"http://localhost:30000/generate",
json={
"text": "The capital of France is",
"sampling_params": {
"top_p": 0.95,
"top_k": 20,
"temperature": 0.8,
"max_new_tokens": 256
},
},
)
print(response.json())
There are several key areas we’re actively working on to boost this integration:
-
Performance Improvements: transformer models currently lag behind the native integration by way of performance.Our primary objective is to optimize and narrow this gap.
-
LoRA Support
-
VLM Integration: we’re also working toward adding support for Vision-Language Models (VLM) to broaden the range of capabilities and use cases.
