The Surprising Key to Distilling Efficient Reasoning Models

We converted our 15B reasoning model to a Mamba hybrid achieving 2.1x throughput with minimal quality loss. The important thing? A non-obvious insight about what data to distill on, and why intuition fails here.

When MiniMax published their M2 post-mortem in October explaining why they abandoned efficient attention at 230B scale, the narrative briefly became “efficient attention is dead.” Inside days, Kimi Linear proved otherwise. The actual lesson: it depends upon your constraints.

Our constraint was easy: we had a powerful 15B reasoning model and needed to make it efficient without starting over. No infinite compute for 20T-token pretraining. No luxury of architectural co-design from day one. Only a practical query: are you able to retrofit efficiency into an existing model through distillation?

Spoilers: yes, but only in the event you ignore your intuition about what data to make use of.

What We Built

The Apriel-H1 family: seven checkpoints spanning 25-40 Mamba layers (out of fifty total), showing the entire efficiency-quality frontier. Our flagship Apriel-H1-15b-Thinker-SFT achieves 2.1x throughput with minimal quality loss: MATH500 and MTBench improve a number of points (0.90 → 0.92 and eight.30 → 8.58, respectively), while GSM8k (0.97 → 0.95), GPQA (0.59 → 0.55), and AIME24 (0.70 → 0.65) regress barely. Total training: 76.8B tokens.

Apriel-H1-15b-Thinker-SFT (green) vs full-attention teacher (blue). Reasoning quality stays nearly flat across benchmarks while throughput increases 1.89-2.09x depending on context length.

The total details are in our Apriel-H1 paper. Here, we deal with the important thing insight that made it work.

The Non-Obvious Insight

Here’s what we initially thought would work: just distill on pretraining data and round it out with some SFT.

The reasoning seemed solid. We’re inserting completely recent Mamba layers which have never seen data. These linear SSMs must learn general-purpose token mixing from scratch. How can they turn into effective mixers unless they get exposure to the identical broad distribution the unique attention layers saw?

So we tried it. Then we tried mixing pretraining and SFT data. It didn’t work. The distilled hybrids lost reasoning quality, sometimes dramatically.

What actually worked: high-quality reasoning traces from the teacher’s SFT dataset.

Distilling a reasoning model is not about transferring general next-token prediction. The bottom model already has that, and we began from a powerful 15B foundation. What we’re preserving is restricted and fragile: the teacher’s multi-step reasoning patterns.

Those patterns emerge from intricate attention mechanisms. Retrieval heads pulling context from 1000’s of tokens back. Induction heads recognizing and continuing logical chains. Long-range dependencies connecting premises to conclusions many steps later. Once you replace attention wholesale with Mamba’s linear reoccurrence, these computational mechanisms are disrupted. The hybrid must discover recent paths to the identical reasoning outcomes.

That discovery requires explicit examples where reasoning structure is visible and proper:

Multi-step math proofs where each thought follows from the previous
Coding tasks with clear logical dependencies
Scientific evaluation with detailed explanatory chains

Pretraining data, alternatively, is just too noisy and too diffuse. The reasoning signal gets lost. You wish concentrated examples of the precise capability you are attempting to preserve.

Once we understood the info selection, our distillation method became clear too. We used reverse KL divergence (temperature 1) relatively than forward KL. Reverse won consistently. Why? We’re training on problems where the teacher has high confidence and clear structure. Reverse KL’s mode-seeking behavior encourages the coed to commit to those high-confidence predictions. When your teacher is confident and proper, you would like your student to be confident too.

This insight is the important thing to the entire approach: match your distillation data to the aptitude you are preserving, not the aptitude you are constructing.

The right way to Apply It: Staged Distillation

You’ll be able to’t just swap 40 attention layers for Mamba and hope. We learned this the hard way, and eventually developed a staged distillation procedure to get there reliably.

Stage 1: Discover least-important layers. We used a Leave-One-Out (LOO) evaluation on MMLU: remove each layer, replace with identity, then measure the drop. Sort by importance, replace the underside 25 with Mamba-in-Llama (MIL) initialized mixers. Distill end-to-end. This worked for our H-25 checkpoint.

Stage 2: Progressive conversion beyond 25 layers. LOO broke down past 25 layers because layers unimportant in isolation became critical together. To deal with this, we developed a dynamic heuristic we call MIL-Mamba-Substitute (MMR). For every remaining attention layer, we initialize a Mamba mixer with MIL, run 100 training steps, and record the distillation loss. Layers converging to lower loss are “easier” to interchange. This captures training dynamics relatively than static importance.

We progressed incrementally: 25 → 27 → 30 → 34 → 37 → 40 Mamba layers, grouping replacements by MMR scores. Each checkpoint distills from the previous.

Stage 3: End-to-end training on SFT data. After reaching the goal Mamba layer count, we did a final SFT pass until reasoning performance stabilized. After 55.9B distillation tokens and 20.9B SFT tokens, this produced our final Apriel-H1-15b-Thinker-SFT model.

The whole efficiency frontier. Each checkpoint shows cumulative training tokens. Our flagship H-30-SFT (released as Apriel-H1-15b-Thinker-SFT) used 76.8B total for two.1x throughput at 0.76 average rating. The aggressively converted H-40 variant used 136.5B tokens for 3.4x throughput. For reference: NVIDIA’s Nemotron-Nano-9B-v2 achieves 4.6x at 0.77 rating but required training from scratch with orders of magnitude more compute.

Making It Reproducible: Fast-LLM

We built all this on Fast-LLM, our open-source training framework. The core architectural principle: large language model transformers needs to be modular. Attention and Mamba are different implementations of the identical “mixing” interface, and may be swapped freely.

Here’s a hybrid architecture in Fast-LLM’s config format:

decoder:
  type: "pattern"
  blocks:
    attention_block:
      mixer:
        type: "attention"
        heads: 32
        head_groups: 8
        head_size: 128
      mlp:
        type: "gated"
        activation: "silu"
    mamba_block:
      mixer:
        type: "mamba"
        d_inner: 4096
        state_size: 16
        dt_rank: 16
      mlp:
        type: "gated"
        activation: "silu"
  num_blocks: 50
  pattern: ["attention_block", "attention_block", "mamba_block", ...]

The pattern field specifies layer order. For Apriel-H1-15b-Thinker-SFT: 30 mamba_block, 20 attention_block, placed by importance. That is it.

Distillation is configuration too:

model:
  base_model:
    head:
      distillation_model: teacher
      distillation_loss_implementation: reverse_kl
reference_models:
  teacher:
    pretrained:
      format: mistral
      path: path/to/Apriel-Nemotron-15b-Thinker

Fast-LLM handles gradient accumulation, distributed training, tensor parallelism, checkpointing, all the things you wish for large-scale experimentation. It’s open source, and licensed under Apache 2.0. You’ll be able to reproduce this work because we designed the infrastructure to make it reproducible.

FAQs

Why release all checkpoints? Because optimal depends upon your constraints. H-30 offers the most effective balance. H-40 maximizes throughput for latency-critical workloads. The intermediate checkpoints let you select your exact trade-off.

Why do you get different speedups at different context lengths? Mamba’s linear complexity advantage grows with sequence length, and a spotlight degrades quadratically.

Why did you simply try Mamba? We used Mamba-1 for 3 reasons: it has a proven distillation track record, has shown strong empirical performance, and was easy to implement in our framework. It allow us to deal with the info query first.

What were the Mamba hyperparameters? State size 16, DT rank 16, inner dimension 4096. For our GQA setup in Apriel we expanded B (input projection) and x (state) to match total attention heads following M1.

Why didn’t you are trying more advanced conversion methods? We used Mamba-in-Llama initialization and knowledge distillation relatively than MOHAWK’s multi-stage procedure since the latter didn’t show significant benefits in preliminary experiments.

Why did you simply SFT the H-30 model? We only applied SFT to H-30 to validate that distilled hybrids may be improved through standard post-training. The opposite checkpoints are pure distillation but may be fine-tuned similarly.

Why didn’t you explore RL? This was a scoping decision to isolate the distillation query: are you able to transfer reasoning via knowledge distillation alone? Answer: yes. But RL should close remaining quality gaps further. We’re exploring RL for future iterations.

Did you actually show that Apriel-H1 matches full-attention reasoning at similar compute budgets? We didn’t do an apples-to-apples comparison between full-attention Apriel and a hybrid trained identically from pretraining forward. That may require repeating all mid-training and post-training of the teacher with the Apriel-H1 architecture, which was beyond our compute budget. What we will claim though is that retrofitting efficiency via distillation is practical and effective, and that the resulting hybrids may be fine-tuned to match or exceed the teacher’s reasoning quality.

The Production Reality

We have implemented Apriel-H1 in Hugging Face Transformers and vLLM. Transformers integration is simple. We ship a brand new model class with interchangeable attention and Mamba layers. vLLM integration uses their recent Mamba cache operations for continuous batching, prefix caching, and chunked prefill. The vLLM plugin is prepared. We’re currently waiting for final legal approval to open-source it.

Honest assessment: Deploying hybrids today means rough edges. The tooling is maturing fast but is not turnkey. You’ll write custom code, validate numerical behavior rigorously, and work around framework limitations. For teams that may absorb that cost, throughput gains are value it. For people who cannot, waiting could be the precise call.

Takeaway

Most teams haven’t got infinite compute for 20T-token pretraining. When you’ve invested in a powerful base model and want efficiency gains, this work shows a practical path: distill into hybrids using high-quality task-specific data that matches the aptitude you are preserving.

The surprising finding, use reasoning data to distill reasoning, seems obvious on reflection but contradicts initial intuition. We validated it, explained why it really works, and built the infrastructure to make it reproducible.

Try It

Models: Apriel-H1 Collection on HuggingFace
Training framework: Fast-LLM on GitHub
Teacher model: Apriel-Nemotron-15B-Thinker
Paper: Apriel-H1: Towards Efficient Enterprise Reasoning Models

Found something broken? File a problem. Discovered a greater layer placement heuristic? Tell us. Built something interesting on Apriel-H1? We would like to see it.

Citation:

@article{apriel-h1-2025,
  title={Apriel-H1: Towards Efficient Enterprise Reasoning Models},
  writer={SLAM Lab, ServiceNow},
  journal={arXiv preprint arXiv:2511.02651},
  yr={2025}
}

Core contributors: Oleksiy Ostapenko, Luke Kumar, Raymond Li, Denis Kocetkov, Joel Lamy-Poirier, Torsten Scholak
Contributors: Shruthan Radhakrishna, Soham Parikh, Shambhavi Mishra
Technical co-leads: Torsten Scholak, Sathwik Tejaswi Madhusudhan

Source link

The Surprising Key to Distilling Efficient Reasoning Models

What We Built

The Non-Obvious Insight

The right way to Apply It: Staged Distillation

Making It Reproducible: Fast-LLM

FAQs

The Production Reality

Takeaway

Try It

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Predict Extreme Weather Events in Minutes With out a Supercomputer

Admins and defenders gird themselves against maximum-severity server vulnerability

Introducing Google Antigravity

Workspace Studio goals to unravel the actual agent problem: Getting employees to make use of them

MIT engineers design an aerial microrobot that may fly as fast as a bumblebee

The Surprising Key to Distilling Efficient Reasoning Models

What We Built

The Non-Obvious Insight

The right way to Apply It: Staged Distillation

Making It Reproducible: Fast-LLM

FAQs

The Production Reality

Takeaway

Try It

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.