Mixture of Experts (MoEs) in Transformers

-


Over the past few years, scaling dense language models has driven most progress in LLMs. From early models like the unique ULMFiT (~30M parameters) or GPT-2 (1.5B parameters, which on the time was considered “too dangerous to release” 🧌), and eventually to today’s hundred-billion–parameter systems, the recipe was easy:

More data + more parameters gives higher performance.

Scaling laws reinforced this trend, but dense scaling has practical limits:

  • Training becomes increasingly expensive.
  • Inference latency grows.
  • Deployment requires significant memory and hardware.

That is where Mixture of Experts (MoEs) enter the image.

When you’re already conversant in MoEs and wish to leap straight into the engineering work done in transformers, you possibly can head on to Transformers and MoEs.



From Dense to Sparse: What Are MoEs?

A Mixture of Experts model keeps the Transformer backbone, but replaces certain dense feed-forward layers with a set of experts. An “expert” shouldn’t be a topic-specialized module (e.g., “math expert”, “code expert”). It is solely a learnable sub-network. For every token, a router selects a small subset of experts to process it.

Different tokens activate different experts, based on their hidden representations.

Model capability is dependent upon total parameters, but inference speed is dependent upon lively parameters.

That is the important thing idea.

For instance, take gpt-oss-20b. It has 21B total parameters, but uses 4 lively experts per token, out of a complete of 32 experts. Considering the shared components plus the lively experts, this model uses ~3.6B lively parameters per token. Running this model on an M3 Ultra Mac, which has a memory bandwidth of about 800 GB, we could estimate generation speed as ~ 800 / (3.6 * 2) in bfloat16, where each parameter takes 2 bytes. This yields about 111 tokens per second. The actual performance number we get is ~115 tok/s, which could be very near the back-of-the-envelope calculation.

This super fast speed confirms the model works roughly as a 3.6B parameter one, however it has the identical capability (or quality) as a 21B parameter model.

(Note: speed can be even faster if we used kernels for the native mxfp4 quantization the model uses).

MoEs are attractive for these reasons:

  1. Higher Compute Efficiency

    Given a hard and fast training FLOP budget, MoEs often outperform dense counterparts.

    This implies faster iteration and higher scaling efficiency.

  2. A Natural Parallelization Axis

    Experts provide a structural boundary within the computation graph. Since different tokens engage different experts, we will parallelize across experts (we discuss this later in Expert Parallelism).

  3. Industry Adoption

    Recent major MoE releases of open models that happened prior to now few weeks include Qwen 3.5, MiniMax M2, GLM-5, or Kimi K2.5.

    The trend accelerated after the success of DeepSeek R1 in January 2025, constructing on earlier systems like DeepSeek V2. One other early MoE was Mixtral-8x7B, released in December 2023.

    2-year timeline of MoE model addition in the transformers package
    Figure 3: 2-year timeline of MoE model addition to the transformers library. DeepSeek R1 marks a transparent inflection point.

    Closed labs use MoEs too. ChatGPT has long been rumored to make use of a sparse architecture, and the open gpt-oss models actually do.

If you need to learn more about MoEs normally, we strongly suggest reading this blog and watching our recent YouTube video on routing.



Transformers and MoEs

Most tooling within the ecosystem, including model loading, device placement, quantization, and backend execution was originally designed for dense models. MoEs challenge these assumptions.

Making MoEs first-class residents in transformers means redesigning parts of the loading pipeline, execution model, and distributed abstractions, not only adding recent model classes. We’ll give attention to how the transformers library has evolved to support sparse architectures across:



Weight Loading Refactor

AutoModelForCausalLM.from_pretrained("model_id") downloads and loads model weights right into a PyTorch model. For dense models, loading is comparatively straightforward where each tensor within the checkpoint maps one-to-one to a parameter within the runtime module.

For MoEs, it’s more complicated. In most MoE checkpoints, each expert is serialized independently. When you peek contained in the DeepSeek-V3 checkpoint index, you’ll see keys like:

model.layers.3.mlp.experts.0.gate_proj.weight
...
model.layers.3.mlp.experts.255.gate_proj.weight

Each expert has its own set of weight matrices, essentially 256 (0 to 255 total, taking DeepSeek-V3 for instance) small feed-forward networks saved side by side. At runtime, nonetheless, GPUs execute optimized kernels. Modern MoE kernels corresponding to grouped GEMMs and fused MoE implementations are designed to process all experts in a single operation, not by looping over them one after the other.

To do this efficiently, they require expert weights to be packed right into a single contiguous tensor.

So now we have a mismatch:

  • Checkpoint: 256 separate tensors
  • Runtime: 1 packed tensor

Bridging this gap systematically is what the weight loading refactor enables.

With the introduction of a generic WeightConverter, the mental model shifted from:

A checkpoint already matches my runtime layout; loading is usually a key-by-key copy.

to:

A checkpoint is only a serialized source of tensors. Loading is a conversion pipeline that transforms them into the runtime layout we would like.



Dynamic Weight Loading with WeightConverter

The central abstraction introduced by this refactor is dynamic weight loading via a WeightConverter.

WeightConverter lets us define:

source key patterns → goal key(s) + operations

Primitive operations (chunk, concatenate, etc.) are composable. Two which can be particularly useful for MoEs:

  • MergeModulelist merges an inventory of tensors right into a single tensor. For instance, you possibly can compose MergeModulelist with Concatenate to stack the experts in a MoE and pack them into one tensor.

    WeightConverter(
        ["block_sparse_moe.experts.*.w1.weight", "block_sparse_moe.experts.*.w3.weight",],
        "mlp.experts.gate_up_proj",
        operations=[
            MergeModulelist(dim=0),
            Concatenate(dim=1),
        ],
    )
    
  • SplitModulelist splits a tensor back into an inventory of tensors. For instance, you possibly can split a stack of experts back into individual experts.

    WeightConverter(
        "mlp.experts.down_proj",
        "block_sparse_moe.experts.*.w2.weight",
        operations=[SplitModulelist(dim=0)],
    )
    



Lazy Materialization of Tensors

The refactor improves not only what conversions exist, but how they’re scheduled.

The loader scans checkpoint keys once, matches them against converter patterns, and groups tensors per converter. Once a secret is identified as needed, it’s registered as a future and materialized via a thread pool. Conversion operations run just once their dependencies are ready. For instance, MergeModulelist waits until all experts for a layer are loaded.

This avoids repeated scans and reduces memory peaks.



Benchmark: Weight-Loading Pipeline Improvements

To judge the improvements introduced by the brand new weight-loading pipeline, we benchmarked the v4 vs v5 versions of transformers. The main focus is on loading speed of huge MoE models, which is usually a bottleneck in training and inference.

We benchmarked v4 vs v5 using:

Example:

from transformers import AutoModelForCausalLM

model_id = "Qwen/Qwen1.5-110B-Chat"
model = AutoModelForCausalLM.from_pretrained(model_id)

Two relevant environment variables:



Results

Model: Qwen/Qwen1.5-110B-Chat
GPU: 1× A100 (80GB)

Version Strategy Loading Mode Time
v4.57.6 device_map="auto" Threadpool 66.24s
v4.57.6 device_map="auto" Sequential 67.29s
v4.57.6 TP OOM
v5 device_map="auto" Async (default) 20.71s
v5 device_map="auto" Sync 45.3s
v5 TP Async 10.1s
v5 TP Sync 19.28s
Loading benchmarks
Figure 4: Loading benchmarks (v4 vs v5)

The speedup shouldn’t be just “more threads.”

It’s the mixture of Single-pass routing, Async materialization, and Conversion-aware scheduling which together avoid unnecessary materialization and memory peaks while enabling expert packing and projection fusion at load time.



Where Quantization Suits In

With this refactor we will now create the runtime module structure first after which convert the weights into the structure. We are able to now optionally attach quantization throughout the conversion pipeline, making quantization a part of the burden loading pipeline itself. That is crucial because quantizing “per expert” only is smart once experts exist in a predictable packed layout.

This end to finish pipeline was impossible earlier and now it involves the users as an exposed API.



Expert Backend

Once experts are packed right into a single runtime tensor, one other query arises:

How do you truly route through them efficiently?

In a Mixture of Experts model, each token is routed to different experts. This implies the runtime must dispatch tokens to their chosen expert weights, execute the projections efficiently, apply the routing weights after which collect and reorder the outcomes.

That is what the Experts Backend system (introduced in PR #42697) addresses. The Experts Backend introduces a pluggable execution architecture that decouples expert computation from the model implementation. As a substitute of hardcoding one dispatch strategy inside each MoE model, the system allows expert layers to dynamically select a backend at runtime.

That is implemented via a decorator pattern:

@use_experts_implementation

The decorator wraps expert classes and dispatches computation to the chosen backend robotically.

Three backends are currently provided:

  1. eager which loops over the chosen experts and applies projections per expert. That is used for correctness reference and debugging.

  2. batched_mm uses the torch.bmm API. This duplicate chosen expert weights per token and performs a single batched GEMM. This backend could be very well suited to small batch, GPU-heavy workloads where memory is offered.

  3. grouped_mm uses torch._grouped_mm API. Here we sort tokens by expert ID, group them, after which perform a single grouped GEMM. This backend shines with large batches or memory-constrained setups.

Figure: Expert backend illustration



Expert Parallelism

Mixture of Experts (MoE) models can have a whole bunch of billions of parameters (excess of what matches on a single GPU). Expert parallelism (EP) addresses this by distributing experts across multiple devices. Each device loads only its assigned subset of experts, computes for those experts after which participates in result aggregation. This approach scales models to far larger parameter counts without increasing computation cost because each token prompts only a number of experts.

Expert parallelism is enabled via enable_expert_parallel:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.distributed.configuration_utils import DistributedConfig

distributed_config = DistributedConfig(enable_expert_parallel=True)

model = AutoModelForCausalLM.from_pretrained(
    "openai/gpt-oss-120b",
    dtype="auto",
    distributed_config=distributed_config,
)

Launch with:

torchrun --nproc-per-node N script.py

Where N evenly divides the overall variety of experts, and possibly matches the variety of GPUs in your node.

When enable_expert_parallel=True, the model switches from the usual tensor-parallel (TP) plan to an expert-parallel (EP) plan with specialized sharding strategies.

Core components of EP lie in:

  1. GroupedGemmParallel: This splits the expert weights along the expert dimension (dim=0). Here each device loads only num_experts / num_devices.

  2. RouterParallel: This remaps global expert indices to local indices, masks out experts not assigned to the present rank, ensures each device computes only with its local experts and uses an all-reduce to mix partial outputs across devices.



Training MoEs with Transformers

MoEs are excellent for scaling inference, but training them is significantly more complex.

MoEs have a Massive parameter count, the distributed expert communication is complicated, there are routing in-stabilities that must be handled. To deal with this, we collaborated with Unsloth to enable significantly faster Mixture-of-Experts training:

  • ~12× faster MoE training
  • >35% VRAM reduction
  • ~6× longer context
  • 12–30× overall speedup in comparison with v4

We leverage the Expert Backend abstraction, standardize around PyTorch’s torch._grouped_mm API and use custom Triton grouped-GEMM + LoRA kernels. Unsloth builds on top of the Transformers (and TRL) optimizations to push performance further.

For full details, we recommend reading: Unsloth’s official guide



Conclusion

As sparse architectures proceed to evolve, we would like the transformers library to evolve with them. When you’re constructing with MoEs or experimenting with recent sparse ideas, we’d love to listen to from you. Tell us what abstractions, kernels, or workflows you’d wish to see next in transformers.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x