Over the past few years, scaling dense language models has driven most progress in LLMs. From early models like the unique ULMFiT (~30M parameters) or GPT-2 (1.5B parameters, which on the time was considered “too dangerous to release” 🧌), and eventually to today’s hundred-billion–parameter systems, the recipe was easy:
More data + more parameters gives higher performance.
Scaling laws reinforced this trend, but dense scaling has practical limits:
- Training becomes increasingly expensive.
- Inference latency grows.
- Deployment requires significant memory and hardware.
That is where Mixture of Experts (MoEs) enter the image.
When you’re already conversant in MoEs and wish to leap straight into the engineering work done in transformers, you possibly can head on to Transformers and MoEs.
From Dense to Sparse: What Are MoEs?
A Mixture of Experts model keeps the Transformer backbone, but replaces certain dense feed-forward layers with a set of experts. An “expert” shouldn’t be a topic-specialized module (e.g., “math expert”, “code expert”). It is solely a learnable sub-network. For every token, a router selects a small subset of experts to process it.
Different tokens activate different experts, based on their hidden representations.
Model capability is dependent upon total parameters, but inference speed is dependent upon lively parameters.
That is the important thing idea.
For instance, take gpt-oss-20b. It has 21B total parameters, but uses 4 lively experts per token, out of a complete of 32 experts. Considering the shared components plus the lively experts, this model uses ~3.6B lively parameters per token. Running this model on an M3 Ultra Mac, which has a memory bandwidth of about 800 GB, we could estimate generation speed as ~ 800 / (3.6 * 2) in bfloat16, where each parameter takes 2 bytes. This yields about 111 tokens per second. The actual performance number we get is ~115 tok/s, which could be very near the back-of-the-envelope calculation.
This super fast speed confirms the model works roughly as a 3.6B parameter one, however it has the identical capability (or quality) as a 21B parameter model.
(Note: speed can be even faster if we used kernels for the native mxfp4 quantization the model uses).
MoEs are attractive for these reasons:
-
Higher Compute Efficiency
Given a hard and fast training FLOP budget, MoEs often outperform dense counterparts.
This implies faster iteration and higher scaling efficiency.
-
A Natural Parallelization Axis
Experts provide a structural boundary within the computation graph. Since different tokens engage different experts, we will parallelize across experts (we discuss this later in Expert Parallelism).
-
Industry Adoption
Recent major MoE releases of open models that happened prior to now few weeks include Qwen 3.5, MiniMax M2, GLM-5, or Kimi K2.5.
The trend accelerated after the success of DeepSeek R1 in January 2025, constructing on earlier systems like DeepSeek V2. One other early MoE was Mixtral-8x7B, released in December 2023.

Figure 3: 2-year timeline of MoE model addition to the transformerslibrary. DeepSeek R1 marks a transparent inflection point.Closed labs use MoEs too. ChatGPT has long been rumored to make use of a sparse architecture, and the open gpt-oss models actually do.
If you need to learn more about MoEs normally, we strongly suggest reading this blog and watching our recent YouTube video on routing.
Transformers and MoEs
Most tooling within the ecosystem, including model loading, device placement, quantization, and backend execution was originally designed for dense models. MoEs challenge these assumptions.
Making MoEs first-class residents in transformers means redesigning parts of the loading pipeline, execution model, and distributed abstractions, not only adding recent model classes. We’ll give attention to how the transformers library has evolved to support sparse architectures across:
Weight Loading Refactor
AutoModelForCausalLM.from_pretrained("model_id") downloads and loads model weights right into a PyTorch model. For dense models, loading is comparatively straightforward where each tensor within the checkpoint maps one-to-one to a parameter within the runtime module.
For MoEs, it’s more complicated. In most MoE checkpoints, each expert is serialized independently. When you peek contained in the DeepSeek-V3 checkpoint index, you’ll see keys like:
model.layers.3.mlp.experts.0.gate_proj.weight
...
model.layers.3.mlp.experts.255.gate_proj.weight
Each expert has its own set of weight matrices, essentially 256 (0 to 255 total, taking DeepSeek-V3 for instance) small feed-forward networks saved side by side. At runtime, nonetheless, GPUs execute optimized kernels. Modern MoE kernels corresponding to grouped GEMMs and fused MoE implementations are designed to process all experts in a single operation, not by looping over them one after the other.
To do this efficiently, they require expert weights to be packed right into a single contiguous tensor.
So now we have a mismatch:
- Checkpoint: 256 separate tensors
- Runtime: 1 packed tensor
Bridging this gap systematically is what the weight loading refactor enables.
With the introduction of a generic WeightConverter, the mental model shifted from:
A checkpoint already matches my runtime layout; loading is usually a key-by-key copy.
to:
A checkpoint is only a serialized source of tensors. Loading is a conversion pipeline that transforms them into the runtime layout we would like.
Dynamic Weight Loading with WeightConverter
The central abstraction introduced by this refactor is dynamic weight loading via a WeightConverter.
WeightConverter lets us define:
source key patterns → goal key(s) + operations
Primitive operations (chunk, concatenate, etc.) are composable. Two which can be particularly useful for MoEs:
-
MergeModulelistmerges an inventory of tensors right into a single tensor. For instance, you possibly can composeMergeModulelistwithConcatenateto stack the experts in a MoE and pack them into one tensor.WeightConverter( ["block_sparse_moe.experts.*.w1.weight", "block_sparse_moe.experts.*.w3.weight",], "mlp.experts.gate_up_proj", operations=[ MergeModulelist(dim=0), Concatenate(dim=1), ], ) -
SplitModulelistsplits a tensor back into an inventory of tensors. For instance, you possibly can split a stack of experts back into individual experts.WeightConverter( "mlp.experts.down_proj", "block_sparse_moe.experts.*.w2.weight", operations=[SplitModulelist(dim=0)], )
Lazy Materialization of Tensors
The refactor improves not only what conversions exist, but how they’re scheduled.
The loader scans checkpoint keys once, matches them against converter patterns, and groups tensors per converter. Once a secret is identified as needed, it’s registered as a future and materialized via a thread pool. Conversion operations run just once their dependencies are ready. For instance, MergeModulelist waits until all experts for a layer are loaded.
This avoids repeated scans and reduces memory peaks.
Benchmark: Weight-Loading Pipeline Improvements
To judge the improvements introduced by the brand new weight-loading pipeline, we benchmarked the v4 vs v5 versions of transformers. The main focus is on loading speed of huge MoE models, which is usually a bottleneck in training and inference.
We benchmarked v4 vs v5 using:
Example:
from transformers import AutoModelForCausalLM
model_id = "Qwen/Qwen1.5-110B-Chat"
model = AutoModelForCausalLM.from_pretrained(model_id)
Two relevant environment variables:
Results
Model: Qwen/Qwen1.5-110B-Chat
GPU: 1× A100 (80GB)
| Version | Strategy | Loading Mode | Time |
|---|---|---|---|
| v4.57.6 | device_map="auto" |
Threadpool | 66.24s |
| v4.57.6 | device_map="auto" |
Sequential | 67.29s |
| v4.57.6 | TP | — | OOM |
| v5 | device_map="auto" |
Async (default) | 20.71s |
| v5 | device_map="auto" |
Sync | 45.3s |
| v5 | TP | Async | 10.1s |
| v5 | TP | Sync | 19.28s |
The speedup shouldn’t be just “more threads.”
It’s the mixture of Single-pass routing, Async materialization, and Conversion-aware scheduling which together avoid unnecessary materialization and memory peaks while enabling expert packing and projection fusion at load time.
Where Quantization Suits In
With this refactor we will now create the runtime module structure first after which convert the weights into the structure. We are able to now optionally attach quantization throughout the conversion pipeline, making quantization a part of the burden loading pipeline itself. That is crucial because quantizing “per expert” only is smart once experts exist in a predictable packed layout.
This end to finish pipeline was impossible earlier and now it involves the users as an exposed API.
Expert Backend
Once experts are packed right into a single runtime tensor, one other query arises:
How do you truly route through them efficiently?
In a Mixture of Experts model, each token is routed to different experts. This implies the runtime must dispatch tokens to their chosen expert weights, execute the projections efficiently, apply the routing weights after which collect and reorder the outcomes.
That is what the Experts Backend system (introduced in PR #42697) addresses. The Experts Backend introduces a pluggable execution architecture that decouples expert computation from the model implementation. As a substitute of hardcoding one dispatch strategy inside each MoE model, the system allows expert layers to dynamically select a backend at runtime.
That is implemented via a decorator pattern:
@use_experts_implementation
The decorator wraps expert classes and dispatches computation to the chosen backend robotically.
Three backends are currently provided:
-
eagerwhich loops over the chosen experts and applies projections per expert. That is used for correctness reference and debugging. -
batched_mmuses thetorch.bmmAPI. This duplicate chosen expert weights per token and performs a single batched GEMM. This backend could be very well suited to small batch, GPU-heavy workloads where memory is offered. -
grouped_mmusestorch._grouped_mmAPI. Here we sort tokens by expert ID, group them, after which perform a single grouped GEMM. This backend shines with large batches or memory-constrained setups.
Expert Parallelism
Mixture of Experts (MoE) models can have a whole bunch of billions of parameters (excess of what matches on a single GPU). Expert parallelism (EP) addresses this by distributing experts across multiple devices. Each device loads only its assigned subset of experts, computes for those experts after which participates in result aggregation. This approach scales models to far larger parameter counts without increasing computation cost because each token prompts only a number of experts.
Expert parallelism is enabled via enable_expert_parallel:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.distributed.configuration_utils import DistributedConfig
distributed_config = DistributedConfig(enable_expert_parallel=True)
model = AutoModelForCausalLM.from_pretrained(
"openai/gpt-oss-120b",
dtype="auto",
distributed_config=distributed_config,
)
Launch with:
torchrun --nproc-per-node N script.py
Where N evenly divides the overall variety of experts, and possibly matches the variety of GPUs in your node.
When enable_expert_parallel=True, the model switches from the usual tensor-parallel (TP) plan to an expert-parallel (EP) plan with specialized sharding strategies.
Core components of EP lie in:
-
GroupedGemmParallel: This splits the expert weights along the expert dimension (dim=0). Here each device loads onlynum_experts / num_devices. -
RouterParallel: This remaps global expert indices to local indices, masks out experts not assigned to the present rank, ensures each device computes only with its local experts and uses an all-reduce to mix partial outputs across devices.
Training MoEs with Transformers
MoEs are excellent for scaling inference, but training them is significantly more complex.
MoEs have a Massive parameter count, the distributed expert communication is complicated, there are routing in-stabilities that must be handled. To deal with this, we collaborated with Unsloth to enable significantly faster Mixture-of-Experts training:
- ~12× faster MoE training
- >35% VRAM reduction
- ~6× longer context
- 12–30× overall speedup in comparison with v4
We leverage the Expert Backend abstraction, standardize around PyTorch’s torch._grouped_mm API and use custom Triton grouped-GEMM + LoRA kernels. Unsloth builds on top of the Transformers (and TRL) optimizations to push performance further.
For full details, we recommend reading: Unsloth’s official guide
Conclusion
As sparse architectures proceed to evolve, we would like the transformers library to evolve with them. When you’re constructing with MoEs or experimenting with recent sparse ideas, we’d love to listen to from you. Tell us what abstractions, kernels, or workflows you’d wish to see next in transformers.


