GPT OSS is a hugely anticipated open-weights release by OpenAI, designed for powerful reasoning, agentic tasks, and versatile developer use cases. It comprises two models: an enormous one with 117B parameters (gpt-oss-120b), and a smaller one with 21B parameters (gpt-oss-20b). Each are mixture-of-experts (MoEs) and use a 4-bit quantization scheme (MXFP4), enabling fast inference (because of fewer energetic parameters, see details below) while keeping resource usage low. The massive model suits on a single H100 GPU, while the small one runs inside 16GB of memory and is ideal for consumer hardware and on-device applications.
To make it even higher and more impactful for the community, the models are licensed under the Apache 2.0 license, together with a minimal usage policy:
We aim for our tools for use safely, responsibly, and democratically, while maximizing your control over how you utilize them. Through the use of gpt-oss, you conform to comply with all applicable law.
Based on OpenAI, this release is a meaningful step of their commitment to the open-source ecosystem, in step with their stated mission to make the advantages of AI broadly accessible. Many use cases depend on private and/or local deployments, and we at Hugging Face are super excited to welcome OpenAI to the community. We consider these might be long-lived, inspiring and impactful models.
Contents
Overview of Capabilities and Architecture
- 21B and 117B total parameters, with 3.6B and 5.1B energetic parameters, respectively.
- 4-bit quantization scheme using mxfp4 format. Only applied on the MoE weights. As stated, the 120B suits in a single 80 GB GPU and the 20B suits in a single 16GB GPU.
- Reasoning, text-only models; with chain-of-thought and adjustable reasoning effort levels.
- Instruction following and gear use support.
- Inference implementations using transformers, vLLM, llama.cpp, and ollama.
- Responses API is beneficial for inference.
- License: Apache 2.0, with a small complementary use policy.
Architecture
- Token-choice MoE with SwiGLU activations.
- When calculating the MoE weights, a softmax is taken over chosen experts (softmax-after-topk).
- Each attention layer uses RoPE with 128K context.
- Alternate attention layers: full-context, and sliding 128-token window.
- Attention layers use a learned attention sink per-head, where the denominator of the softmax has a further additive value.
- It uses the identical tokenizer as GPT-4o and other OpenAI API models.
- Some recent tokens have been incorporated to enable compatibility with the Responses API.

o3 and o4-mini (Source: OpenAI).
API access through Inference Providers
OpenAI GPT OSS models are accessible through Hugging Face’s Inference Providers service, allowing you to send requests to any supported provider using the identical JavaScript or Python code. This is similar infrastructure that powers OpenAI’s official demo on gpt-oss.com, and you need to use it for your personal projects.
Below is an example that uses Python and the super-fast Cerebras provider. For more information and extra snippets, check the inference providers section within the model cards and the dedicated guide we crafted for these models.
import os
from openai import OpenAI
client = OpenAI(
base_url="https://router.huggingface.co/v1",
api_key=os.environ["HF_TOKEN"],
)
completion = client.chat.completions.create(
model="openai/gpt-oss-120b:cerebras",
messages=[
{
"role": "user",
"content": "How many rs are in the word 'strawberry'?",
}
],
)
print(completion.decisions[0].message)
Inference Providers also implements an OpenAI-compatible Responses API, essentially the most advanced OpenAI interface for chat models, designed for more flexible and intuitive interactions.
Below is an example using the Responses API with the Fireworks AI provider. For more details, try the open-source responses.js project.
import os
from openai import OpenAI
client = OpenAI(
base_url="https://router.huggingface.co/v1",
api_key=os.getenv("HF_TOKEN"),
)
response = client.responses.create(
model="openai/gpt-oss-20b:fireworks-ai",
input="What number of rs are within the word 'strawberry'?",
)
print(response)
Local Inference
Using Transformers
It’s worthwhile to install the newest transformers release (v4.55.1 or later), in addition to speed up and kernels. We also recommend installing triton 3.4 or higher, because it unblocks support for mxfp4 quantization on CUDA hardware:
pip install --upgrade transformers kernels speed up "triton>=3.4"
The model weights are quantized in mxfp4 format, which was originally available on GPUs of the Hopper or Blackwell families, but now works on previous CUDA architectures (including Ada, Ampere, and Tesla). Installing triton 3.4, along with the kernels library, makes it possible to download optimized mxfp4 kernels on first use, achieving large memory savings. With these components in place, you possibly can run the 20B model on GPUs with 16 GB of RAM. This includes many consumer cards (3090, 4090, 5080) in addition to Colab and Kaggle!
If the previous libraries aren’t installed (otherwise you don’t have a compatible GPU), loading the model will fall back to bfloat16, unpacked from the quantized weights.
The next snippet shows easy inference with the 20B model. As explained, it runs on 16 GB GPUs when using mxfp4, or ~48 GB in bfloat16.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto",
)
messages = [
{"role": "user", "content": "How many rs are in the word 'strawberry'?"},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
).to(model.device)
generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))
Flash Attention 3
The models use attention sinks, a method the vLLM team made compatible with Flash Attention 3. Now we have packaged and integrated their optimized kernel in kernels-community/vllm-flash-attn3. On the time of writing, this super-fast kernel has been tested on Hopper cards with PyTorch 2.7 and a pair of.8. We expect increased coverage in the approaching days. For those who run the models on Hopper cards (for instance, H100 or H200), you should pip install --upgrade kernels and add the next line to your snippet:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto",
+ # Flash Attention with Sinks
+ attn_implementation="kernels-community/vllm-flash-attn3",
)
messages = [
{"role": "user", "content": "How many rs are in the word 'strawberry'?"},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
).to(model.device)
generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))
This snippet will download the optimized, pre-compiled kernel code from kernels-community, as explained in our previous blog post. The transformers team has built, packaged, and tested the code, so it’s totally protected so that you can use.
Other optimizations
We recommend you utilize mxfp4 in case your GPU supports it. For those who can moreover use Flash Attention 3, then by all means do enable it!
In case your GPU just isn’t compatible with
mxfp4, then we recommend you utilize MegaBlocks MoE kernels for a pleasant speed bump. To achieve this, you simply need to regulate your inference code like this:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto",
+ # Optimize MoE layers with downloadable` MegaBlocksMoeMLP
+ use_kernels=True,
)
messages = [
{"role": "user", "content": "How many rs are in the word 'strawberry'?"},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_tensors="pt",
return_dict=True,
).to(model.device)
generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))
MegaBlocks optimized MoE kernels require the model to run on
bfloat16, so memory consumption might be higher than running onmxfp4. We recommend you utilizemxfp4should you can, otherwise opt in to MegaBlocks viause_kernels=True.
AMD ROCm support
OpenAI GPT OSS has been verified on AMD Instinct hardware, and we’re completely happy to announce initial support for AMD’s ROCm platform in our kernels library, setting the stage for upcoming optimized ROCm kernels in Transformers. MegaBlocks MoE kernel acceleration is already available for OpenAI GPT OSS on AMD Instinct (e.g., MI300-series), enabling higher training and inference performance. You possibly can test it with the identical inference code shown above.
AMD also prepared a Hugging Face Space for users to try the model on AMD hardware.
Summary of Available Optimizations
On the time of writing, this table summarizes our recommendations based on GPU compatibility and our tests. We expect Flash Attention 3 (with sink attention) to develop into compatible with additional GPUs.
| mxfp4 | Flash Attention 3 (w/ sink attention) | MegaBlocks MoE kernels | |
|---|---|---|---|
| Hopper GPUs (H100, H200) | ✅ | ✅ | ❌ |
| CUDA GPUS with 16+ GB of RAM | ✅ | ❌ | ❌ |
| Other CUDA GPUs | ❌ | ❌ | ✅ |
| AMD Instinct (MI3XX) | ❌ | ❌ | ✅ |
| The right way to enable | triton 3.4 + kernels library | Use vllm-flash-attn3 from kernels-community | use_kernels |
Though the 120B model suits on a single H100 GPU (using mxfp4), it’s also possible to run it easily on multiple GPUs using speed up or torchrun. Transformers provides a default parallelization plan, and you possibly can leverage optimized attention kernels as well. The next snippet may be run with torchrun --nproc_per_node=4 generate.py on a system with 4 GPUs:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.distributed import DistributedConfig
import torch
model_path = "openai/gpt-oss-120b"
tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side="left")
device_map = {
"tp_plan": "auto",
}
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype="auto",
attn_implementation="kernels-community/vllm-flash-attn3",
**device_map,
)
messages = [
{"role": "user", "content": "Explain how expert parallelism works in large language models."}
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1000)
response = tokenizer.decode(outputs[0])
print("Model response:", response.split("<|channel|>final<|message|>")[-1].strip())
The OpenAI GPT OSS models have been trained extensively to leverage tool use as a part of their reasoning efforts. The chat template we crafted for transformers provides a whole lot of flexibility, please check our dedicated section later on this post.
Llama.cpp
Llama.cpp offers native MXFP4 support with Flash Attention, delivering optimal performance across various backends similar to Metal, CUDA, and Vulkan, right from the day-0 release.
To put in it, follow the guide in llama.cpp Github’s repository.
# MacOS
brew install llama.cpp
# Windows
winget install llama.cpp
The beneficial way is to make use of it via llama-server:
llama-server -hf ggml-org/gpt-oss-120b-GGUF -c 0 -fa --jinja --reasoning-format none
# Then, access http://localhost:8080
We support each the 120B and 20B models. For more detailed information, visit this PR or the GGUF model collection.
vLLM
As mentioned, vLLM developed optimized Flash Attention 3 kernels that support sink attention, so that you’ll get best results on Hopper cards. Each the Chat Completion and the Responses APIs are supported. You possibly can install and begin a server with the next snippet, which assumes 2 H100 GPUs are used:
vllm serve openai/gpt-oss-120b --tensor-parallel-size 2
Or, use it in Python directly like:
from vllm import LLM
llm = LLM("openai/gpt-oss-120b", tensor_parallel_size=2)
output = llm.generate("San Francisco is a")
transformers serve
You should use transformers serve to experiment locally with the models, without another dependencies. You possibly can launch the server with just:
transformers serve
To which you’ll send requests using the Responses API.
# responses API
curl -X POST http://localhost:8000/v1/responses
-H "Content-Type: application/json"
-d '{"input": [{"role": "system", "content": "hello"}], "temperature": 1.0, "stream": true, "model": "openai/gpt-oss-120b"}'
You can even send requests using the usual Completions API:
# completions API
curl -X POST http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{"messages": [{"role": "system", "content": "hello"}], "temperature": 1.0, "max_tokens": 1000, "stream": true, "model": "openai/gpt-oss-120b"}'
Wonderful-Tuning
GPT OSS models are fully integrated with trl. Now we have developed a few fine-tuning examples using SFTTrainer to get you began:
Deploy on Hugging Face Partners
Azure
Hugging Face collaborates with Azure on their Azure AI Model Catalog to bring the preferred open-source models —spanning text, vision, speech, and multimodal tasks— directly into customers environments for secured deployments to managed online endpoints, leveraging Azure’s enterprise-grade infrastructure, autoscaling, and monitoring.
The GPT OSS models are actually available on the Azure AI Model Catalog (GPT OSS 20B, GPT OSS 120B), able to be deployed to an internet endpoints for real time inference.

Dell
The Dell Enterprise Hub is a secure online portal that simplifies training and deploying the newest open AI models on-premise using Dell platforms. Developed in collaboration with Dell, it offers optimized containers, native support for Dell hardware, and enterprise-grade safety features.
The GPT OSS models are actually available on Dell Enterprise Hub, able to be deployed on-prem using Dell platforms.

Evaluating the Model
GPT OSS models are reasoning models: they due to this fact require a really large generation size (maximum number of latest tokens) for evaluations, as their generation will first contain reasoning, then the actual answer. Using too small a generation size risks interrupting the prediction in the midst of reasoning, which is able to cause false negatives. The reasoning trace should then be faraway from the model answer before computing metrics, to avoid parsing errors, especially with math or instruct evaluations.
Here’s an example on the way to evaluate the models with lighteval (you should install from source).
git clone https://github.com/huggingface/lighteval
pip install -e .[dev] # make certain you've got the proper transformers version installed!
lighteval speed up
"model_name=openai/gpt-oss-20b,max_length=16384,skip_special_tokens=False,generation_parameters={temperature:1,top_p:1,top_k:40,min_p:0,max_new_tokens:16384}"
"prolonged|ifeval|0|0,lighteval|aime25|0|0"
--save-details --output-dir "openai_scores"
--remove-reasoning-tags --reasoning-tags="[('<|channel|>analysis<|message|>','<|end|><|start|>assistant<|channel|>final<|message|>')]"
For the 20B model, this could provide you with 69.5 (+/-1.9) for IFEval (strict prompt), and 63.3 (+/-8.9) for AIME25 (in pass@1), scores inside expected range for a reasoning model of this size.
If you must do your custom evaluation script, note that to filter out the reasoning tags properly, you will have to make use of skip_special_tokens=False within the tokenizer, in an effort to get the total trace within the model output (to filter reasoning using the identical string pairs as in the instance above) – you possibly can discover why below.
Chats and Chat Templates
OpenAI GPT OSS uses the concept of “channels” in its outputs. More often than not, you will note an “evaluation” channel that comprises things that aren’t intended to be sent to the end-user, like chains of thought, and a “final” channel containing messages which can be actually intended to be exhibited to the user.
Assuming no tools are getting used, the structure of the model output looks like this:
<|start|>assistant<|channel|>evaluation<|message|>CHAIN_OF_THOUGHT<|end|><|start|>assistant<|channel|>final<|message|>ACTUAL_MESSAGE
More often than not, you need to ignore every little thing except the text after <|channel|>final<|message|>. Only this text ought to be appended to the chat because the assistant message, or exhibited to the user. There are two exceptions to this rule, though: It’s possible you’ll need to incorporate evaluation messages within the history during training or if the model is calling external tools.
When training:
For those who’re formatting examples for training, you generally want to incorporate the chain of thought in the ultimate message. The suitable place to do that is within the pondering key.
chat = [
{"role": "user", "content": "Hi there!"},
{"role": "assistant", "content": "Hello!"},
{"role": "user", "content": "Can you think about this one?"},
{"role": "assistant", "thinking": "Thinking real hard...", "content": "Okay!"}
]
inputs = tokenizer.apply_chat_template(chat, add_generation_prompt=False)
You possibly can be at liberty to incorporate pondering keys in previous turns, or once you’re doing inference fairly than training, but they are going to generally be ignored. The chat template will only ever include essentially the most recent chain of thought, and only in training (when add_generation_prompt=False and the ultimate turn is an assistant turn).
The rationale why we do it this manner is subtle: The OpenAI gpt-oss models were trained on multi-turn data where all but the ultimate chain of thought was dropped. Which means when you must fine-tune an OpenAI gpt-oss model, you need to do the identical.
- Let the chat template drop all chains of thought except the ultimate one
- Mask the labels on all turns except the ultimate assistant turn, or else you might be training it on the previous turns without chains of thought, which is able to teach it to emit responses without CoTs. Which means you can’t train on a whole multi-turn conversation as a single sample; as an alternative, you have to break it into one sample per assistant turn with only the ultimate assistant turn unmasked every time, in order that the model can learn from each turn while still accurately only seeing a sequence of thought on the ultimate message every time.
System and Developer Messages
OpenAI GPT OSS is unusual since it distinguishes between a “system” message and a “developer” message at the beginning of the chat, but most other models only use “system”. In GPT OSS, the system message follows a strict format and comprises information like the present date, the model identity and the extent of reasoning effort to make use of, and the “developer” message is more freeform, which makes it (very confusingly) just like the “system” messages of most other models.
To make GPT OSS easier to make use of with the usual API, the chat template will treat a message with “system” or “developer” role because the developer message. If you must modify the actual system message, you possibly can pass the precise arguments model_identity or reasoning_effort to the chat template:
chat = [
{"role": "system", "content": "This will actually become a developer message!"}
]
tokenizer.apply_chat_template(
chat,
model_identity="You're OpenAI GPT OSS.",
reasoning_effort="high"
)
Tool Use With transformers
GPT OSS supports two sorts of tools: The “builtin” tools browser and python, and custom tools supplied by the user. To enable builtin tools, pass their names in an inventory to the builtin_tools argument of the chat template, as shown below. To pass custom tools, you possibly can pass them either as JSON schema or as Python functions with type hints and docstrings using the tools argument. See the chat template tools documentation for more details, or you possibly can just modify the instance below:
def get_current_weather(location: str):
"""
Returns the present weather status at a given location as a string.
Args:
location: The situation to get the weather for.
"""
return "Terrestrial."
chat = [
{"role": "user", "content": "What's the weather in Paris right now?"}
]
inputs = tokenizer.apply_chat_template(
chat,
tools=[weather_tool],
builtin_tools=["browser", "python"],
add_generation_prompt=True,
return_tensors="pt"
)
If the model chooses to call a tool (indicated by a message ending in <|call|>), then you need to add the tool call to the chat, call the tool, then add the tool result to the chat and generate again:
tool_call_message = {
"role": "assistant",
"tool_calls": [
{
"type": "function",
"function": {
"name": "get_current_temperature",
"arguments": {"location": "Paris, France"}
}
}
]
}
chat.append(tool_call_message)
tool_output = get_current_weather("Paris, France")
tool_result_message = {
"role": "tool",
"content": tool_output
}
chat.append(tool_result_message)
Acknowledgements
That is a very important release for the community and it took a momentous effort across teams and firms to comprehensively support the brand new models within the ecosystem.
The authors of this blog post were chosen among the many ones who contributed content to the post itself, and doesn’t represent dedication to the project. Along with the creator list, others contributed significant content reviews, including Merve and Sergio. Thanks!
The mixing and enablement work involved dozens of individuals. In no particular order, we would like to spotlight Cyril, Lysandre, Arthur, Marc, Mohammed, Nouamane, Harry, Benjamin, Matt from the open source team. From the TRL team, Ed, Lewis, and Quentin were all involved. We would also wish to thank Clémentine from Evaluations, and David and Daniel from the Kernels team. On the industrial partnerships side we got significant contributions from Simon, Alvaro, Jeff, Akos, Alvaro, and Ivar. The Hub and Product teams contributed Inference Providers support, llama.cpp support, and plenty of other improvements, all because of Simon, Célina, Pierric, Lucain, Xuan-Son, Chunte, and Julien. Magda and Anna were involved from the legal team.
Hugging Face’s role is to enable the community to make use of these models effectively. We’re indebted to corporations similar to vLLM for advancing the sphere, and cherish our continued collaboration with inference providers to supply ever simpler ways to construct on top of them.
And naturally, we deeply appreciate OpenAI’s decision to release these models for the community at large. Here’s to many more!
