SmolLM3: smol, multilingual, long-context reasoner

Small language models have gotten increasingly vital as users seek capable models that will be deployed efficiently. The community has produced an interesting range of capable small models, each pushing the boundaries of what is possible at this scale. With SmolLM3, we’re excited to contribute a brand new competitive fully open 3B model:

SmolLM3 sits within the efficiency sweet spot. Our 3B model outperforms Llama-3.2-3B and Qwen2.5-3B while staying competitive with larger 4B alternatives (Qwen3 & Gemma3). Beyond the performance numbers, we’re sharing exactly how we built it using public datasets and training frameworks.

Model summary:

3B model trained on 11T tokens, SoTA on the 3B scale and competitive with 4B models
Instruct model with dual mode reasoning, supporting think/no_think modes
Multilingual support for six languages: English, French, Spanish, German, Italian, and Portuguese
Long context as much as 128k with NoPE and using YaRN

The whole recipe: We’re releasing SmolLM3 with our engineering blueprint. It includes architecture details, exact data mixtures showing how we progressively boost performance across domains in a three-stage pretraining approach, and the methodology for constructing a hybrid reasoning model. Often, achieving these results would require months of reverse engineering. As an alternative, we’re providing the complete methodology.

Whether you are constructing your individual models or want to grasp what drives performance at this scale, this blueprint shows the engineering story behind competitive 3B performance.

Let’s have a take a look at the pretraining stage.

SmolLM3 each modified the architecture and data mixture over its predecessors. Let’s have a take a look at the architecture and training configurations first!

Architecture and training details

SmolLM3 follows a transformer decoder architecture with tied embedding just like SmolLM2, constructing on Llama architecture with some key modifications optimized for efficiency and long context performance.

Grouped Query Attention (GQA): We replaced multi-head attention with grouped-query attention using 4 groups. Our ablations on a 3B model trained with 100B tokens from FineWeb-Edu showed that GQA matches the performance of multi-head attention while significantly reducing the KV cache size during inference.

NoPE: We implemented NoPE from “RoPE to NoRoPE and Back Again: A Recent Hybrid Attention Strategy” (Yang et al., 2025), selectively removing rotary position embeddings from every 4th layer. This approach improves long context performance without affecting short context capabilities, as confirmed by our ablations.

Intra-Document Masking: Following “Analysing The Impact of Sequence Composition on Language Model Pre-Training“, during training, we use attention masking to make sure tokens from different documents in the identical training sequence don’t attend to one another. Much like Llama 3, this helps with faster and more stable long context training while maintaining short context performance.

Training Stability: Following OLMo 2, we remove weight decay from embedding layers to enhance training stability. This modification contributed to more stable training dynamics, with embedding norms naturally stabilizing at healthier values during training without impacting overall performance in our ablations.

All these changes were validated through ablations using the identical 3B architecture trained on 100B tokens from FineWeb-Edu, ensuring each modification either improved performance or maintained it while offering other advantages.

Training Configuration: We use a worldwide batch size of two.36M tokens with 4096 sequence length, a learning rate of 2e-4, and the AdamW optimizer (beta1: 0.9, beta2: 0.95) with weight decay of 0.1 and gradient clipping of 1. We use the WSD (Warmup-Stable-Decay) scheduler, with 2000 warmup steps, and a linear decay to 0 in the ultimate 10% training steps. We use nanotron framework for the training, datatrove for data processing and lighteval for evaluation. The model was trained on 384 H100 GPUs for twenty-four days. You’ll be able to see the distributed training setup in the next figure.

Along with architecture changes we also ablated and improved the training recipe. Let’s have a better look.

Data mixture and training stages

Following SmolLM2’s multi-stage approach, we train SmolLM3 on 11.2T tokens using a three-stage training strategy that mixes web, math, and code data with evolving proportions. We conducted extensive ablations on 3B models trained on 50B to 100B tokens to find out the info mixture and ratios.

The pretraining consists of those stages, also shown within the figure above:

Stage 1: Stable phase (0T → 8T tokens) This foundation stage establishes strong general capabilities with our core dataset mixture:
- Web: 85% (12% multilingual) – FineWeb-Edu, DCLM, FineWeb2 and FineWeb2-HQ
- Code: 12% – The Stack v2 (16 programming languages), StarCoder2 pull requests, Jupyter and Kaggle notebooks, GitHub issues, and StackExchange.
- Math: 3% – FineMath3+ and InfiWebMath3+
Stage 2: Stable phase (8T → 10T tokens) We introduce higher quality math and code datasets while maintaining good web coverage:
- Web: 75% (12% Multilingual)
- Code: 15% – Adding Stack-Edu
- Math: 10% – Introducing FineMath4+, InfiWebMath4+, and MegaMath (including Qwen Q&A, Pro synthetic rewrites, and text-code interleaved blocks)
Stage 3: Decay Phase (10T → 11.1T tokens) The ultimate stage further upsamples math and code data
- Web: 63% (12% Multilingual)
- Code: 24% – upsampling of high-quality code data
- Math: 13% – upsampling math data and introducing instruction and reasoning datasets resembling OpenMathReasoning

With these stages and mixtures we achieved very competitive performance for the bottom model. More on that within the evaluation section. The nanotron training configs with exact data weights will be found here. We are going to also share our training logs together with intermediate checkpoints.

After the primary pretraining we improved the model in a mid-training stage for long context and reasoning.

We call the long context adaptation and reasoning adaptation “mid-training”. They’re much shorter than the primary pretraining but still somewhat general and geared toward improving the model in those two domains. Let’s first have a take a look at long context training.

Long Context extension

After the primary pretraining, we trained SmolLM3 on an extra 100B tokens to increase its context length. We sequentially prolonged the context window in two stages for 50B tokens each: first transitioning from 4k to 32k context with RoPE theta increased to 1.5M, then from 32k to 64k context with RoPE theta increased to 5M. Each stages upsampled math, code, and reasoning data. During ablations, we found that upsampling specific long context data resembling code repositories, books, and long web pages (beyond the naturally long samples in our mixture) didn’t further boost performance on RULER and HELMET benchmarks. Using NoPE and training on the decay mixture with longer sequences and increased RoPE theta values was sufficient to attain competitive long context performance as much as 64k.

Following Qwen2.5, we use YARN to extrapolate beyond the training context length. During inference, the model can handle as much as 128k context (2x extension beyond the 64k training length).

Reasoning Mid-training

After extending the context length of the model, we trained it at a mid-training stage to include reasoning capabilities. The primary difference between the mid-training stage and the pre- and post-training stages is that we targeted a general capability without yet specializing in a selected domain. In our case, we desired to train the model to reason without targeting a specific domain, resembling mathematics or computer code.

Our mid-training dataset contained 35B tokens sourced from Open Thought’s OpenThoughts3-1.2M and a subset from NVIDIA’s Llama-Nemotron-Post-Training-Dataset-v1.1 with reasoning traces from R1. We used the ChatML chat template and wrapped packing to avoid providing an excessive amount of structure to the model. We trained the model for 4 (~140B tokens) epochs and used the checkpoint for subsequent SFT stages.

The discharge of reasoning models like DeepSeek R1 and Qwen3 has demonstrated the powerful capabilities that emerge when models can engage in explicit reasoning. Nonetheless, the community still lacks fully open recipes with public datasets to construct dual instruction models that support each reasoning and non-reasoning modes. Most existing approaches involve complex reinforcement learning processes and proprietary datasets, making it difficult for researchers to breed and construct upon these results.

On this section, we explain how we tackled these challenges and share our complete recipe for constructing a dual instruction model. We detail how we balance performance between reasoning and non-reasoning modes through a fastidiously designed training pipeline that features mid-training for general reasoning capabilities, supervised fine-tuning with synthetic data generation, and alignment using Anchored Preference Optimization (APO) – a recent variant of DPO.

Constructing the Chat Template

Before diving into the training methodology, it’s essential to ascertain how users interact with our dual-mode model. The chat template serves because the interface that allows seamless switching between reasoning and non-reasoning modes, and its design directly impacts each our training data format and model behavior. SmolLM3’s chat template allows users to regulate the reasoning mode during a conversation. Users can activate reasoning or non-reasoning modes by including the /think and /no_think flags, respectively, within the system prompt. In non-reasoning mode, we pre-fill the model’s response with empty think blocks, just like Qwen3, to make sure direct answers without explicit reasoning.

SmolLM3 supports tool calling, and its chat template incorporates two distinct sections for tool descriptions: XML Tools and Python Tools. This specific categorization proved helpful in our experiments for the model’s accurate interpretation of tool definitions in each format.

The chat template provides a default system message for each reasoning modes, together with a metadata section that features the date, knowledge cut-off date, and current reasoning mode. Users can replace the default system message by providing one with the system role. The metadata section will be excluded by utilizing the /system_override flag within the system prompt, offering flexibility for specific use cases.

Supervised Finetuning

Following the reasoning mid-training stage, where we trained the model on 140B tokens of general reasoning data, we proceed with Supervised Finetuning (SFT) to include capabilities across each reasoning and non-reasoning modes for math, code, general reasoning, instruction following, multilinguality, and power calling. Training a dual-mode model requires fastidiously balancing the info mixture to take care of strong performance in each modes across all goal domains. To guage SmolLM3’s performance throughout training, we tracked the next domains: math, code, general reasoning, instruction following, and multilinguality.

The first challenge we encountered when constructing the reasoning mode dataset was the scarcity of datasets containing reasoning traces for certain domains. To deal with this gap, we generated synthetic data by prompting Qwen3-32B in reasoning mode with prompts from existing non-reasoning datasets. This allowed us to enhance performance in domains where the model initially struggled in reasoning mode, resembling multi-turn conversations, multilinguality, and on a regular basis conversations.

Our final data mixture was the result of intensive ablations examining the optimal ratio of reasoning to non-reasoning tokens and the composition inside each mode. The resulting SFT dataset comprises 1.8B tokens: 1B in non-reasoning mode and 0.8B in reasoning mode, comprising 12 non-reasoning datasets and 10 datasets with reasoning traces. We trained for 4 epochs (~8B tokens) using BFD (best-fit decreasing) packing with the loss masked on user turns and the outcomes from tool calls.

We are going to release this data mixture together with our full training scripts to enable the community to breed and construct upon our work.

Off-policy model alignment with Anchored Preference Optimization (APO)

After the SFT step, we performed a round of model alignment using a mix of the Tulu3 preference dataset for non-reasoning mode and recent synthetic preference pairs for reasoning mode, that we generated from Qwen3-32B and Qwen3-0.6B. To make sure full coverage of all domains within the non-thinking dataset, we generated complementing pondering mode preference pairs. We chosen generations from Qwen3-32B as “chosen” and responses from Qwen3-0.6B as “rejected” for alignment with Anchored Preference Optimization.

Anchored Preference Optimization (APO) is a variant of Direct Preference Optimization (DPO) that gives a more stable optimization objective. In DPO, the reward function r_θ(x,y) measures the log-ratio of the probability of the sequence during training in comparison with the model firstly of coaching, the reference model:

Here β controls how much the model being optimized can change relative to the reference model. The DPO loss optimizes triplets of prompts x, chosen y_w and rejected y_l responses:

The APO objective has been shown to be more stable, and we also observed higher downstream performance in our internal ablations.

While downstream evaluations showed improvements across mathematics, science, instruction following, coding, chat, and multilingual tasks, we observed performance degradation on long context benchmarks like RULER. We traced this degradation back to the reasoning mid-training stage, where the concentrate on reasoning capabilities impacted long context performance. Moreover, the APO training data was limited to 24k tokens for the reason that overwhelming majority of our reasoning dataset fell below this length.

To mitigate this performance drop, we explored model merging as an answer.

Model Merging

Model merging is a preferred and powerful technique that enables combining the strengths of various models without the computational overhead of ensembling or the necessity for added training. We used the MergeKit library to perform the model merging, because it includes several merging methods, including linear and non-linear merging.

Our merging recipe consists of two steps:

Take each APO checkpoint and create a model “soup”.
Mix the model soup with a mid-training checkpoint that has strong long-content performance. A linear merge with weights of 0.9 and 0.1 for the APO model soup and mid-training checkpoint, respectively, achieved the very best performance. We were in a position to get well the bottom model’s RULER rating on contexts as much as 128k tokens.

The resulting model is the checkpoint we’re releasing today. It maintains performance across a big selection of tasks. So let’s turn to the evaluation result each of this model in addition to the bottom model.

We evaluate base models and the instruct model each in reasoning and non-reasoning mode. Let’s first cover the bottom model’s performance!

Base model

The plot below shows SmolLM3’s win rate across 12 popular benchmarks evaluating knowledge, reasoning, math, and coding capabilities. SmolLM3 consistently outperforms other 3B models and achieves competitive performance with larger 4B models including Qwen3-4B and Gemma3-4B.

Evaluation benchmarks used for the win rate: HellaSwag, ARC, Winogrande, CommonsenseQA, MMLU-CF, MMLU Pro CF, PIQA, OpenBookQA, GSM8K, MATH, HumanEval+, MBPP+

SmolLM3 achieves first or second place on knowledge and reasoning benchmarks (HellaSwag, ARC, BoolQ), demonstrating strong performance in these core capabilities. Math and coding performance is competitive throughout the 3B class. Long-context performance on Ruler 64k shows the model can handle prolonged sequences effectively.

The model demonstrates strong multilingual performance across five major European languages when evaluated on multilingual benchmarks including Global MMLU, MLMM HellaSwag, Flores-200, Belebele, testing knowledge, commonsense reasoning, text understanding, and translation. This shows SmolLM3 maintains consistent performance beyond English.

In summary, the bottom model shows very strong performance across many domains. Let’s see how this translates to the instruct model’s performance.

Dual Instruct / Reasoning model

Since SmolLM3 has each an instruct and reasoning mode we want to guage the model in each modes and compare to models with same capabilities.

No extending pondering evaluation

We evaluate SmolLM3 against other 3B non-reasoning models and compare it to Qwen3 reasoning models in no pondering mode across multiple benchmarks. As shown within the performance chart, SmolLM3 outperforms other 3B non-reasoning models including Llama3.2 3B Instruct and Qwen2.5 3B Instruct and sits at an efficiency sweet spot between reasoning models, significantly outperforming Qwen3 1.7B while getting near the 4B model performance at a lower computational cost.

So the instruct model sits right on the pareto front of performance and value. Let’s see how the reasoning model does!

Extending pondering evaluation

When evaluating SmolLM3’s reasoning capabilities with prolonged pondering enabled, the model shows substantial improvements across most benchmarks in comparison with its non-reasoning counterpart. We observe notable gains in difficult tasks like AIME 2025 (36.7% vs 9.3%), competitive programming on LiveCodeBench (30.0% vs 15.2%), and graduate-level reasoning on GPQA Diamond (41.7% vs 35.7%).

While Qwen3 4B generally achieves the very best scores across each pondering and non-thinking modes, SmolLM3 demonstrates competitive performance throughout the 3B parameter class, particularly excelling in mathematical reasoning and complicated problem-solving tasks. The model’s dual-mode capability allows users to choose from faster inference without reasoning or more thorough evaluation with prolonged pondering.

So the last query is: how are you going to use the model?

The modeling code for SmolLM3 is accessible in transformers v4.53.0, so make certain to upgrade your transformers version. You may as well load the model with the newest vllm which uses transformers as a backend.

pip install -U transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "HuggingFaceTB/SmolLM3-3B"
device = "cuda" 


tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
).to(device)


prompt = "Give me a temporary explanation of gravity in easy terms."
messages_think = [
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages_think,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)


generated_ids = model.generate(**model_inputs, max_new_tokens=32768)


output_ids = generated_ids[0][len(model_inputs.input_ids[0]) :]
print(tokenizer.decode(output_ids, skip_special_tokens=True))

We recommend setting temperature=0.6 and top_p=0.95 within the sampling parameters.

Enabling and Disabling Prolonged Pondering Mode

We enable prolonged pondering by default, so the instance above generates the output with a reasoning trace. For selecting between enabling, you possibly can provide the /think and /no_think flags through the system prompt as shown within the snippet below for prolonged pondering disabled. The code for generating the response with prolonged pondering can be the identical except that the system prompt must have /think as an alternative of /no_think.

prompt = "Give me a temporary explanation of gravity in easy terms."
messages = [
    {"role": "system", "content": "/no_think"},
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

Agentic Usage

SmolLM3 supports tool calling! Just pass your list of tools under the argument xml_tools (for traditional tool-calling), or python_tools (for calling tools like python functions in a snippet).


from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "HuggingFaceTB/SmolLM3-3B"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)

tools = [
    {
        "name": "get_weather",
        "description": "Get the weather in a city",
        "parameters": {"type": "object", "properties": {"city": {"type": "string", "description": "The city to get the weather for"}}}}
]

messages = [
    {
        "role": "user",
        "content": "Hello! How is the weather today in Copenhagen?"
    }
]

inputs = tokenizer.apply_chat_template(
    messages,
    enable_thinking=False, 
    xml_tools=tools,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt"
)

outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

We release SmolLM3, a small, long-context, multilingual, reasoner with as much as 128k context. Along with the model checkpoint we release the complete training recipe including pre-training, mid-training, post-training, and artificial data generation in addition to the datasets (coming shortly). We hope this model proves useful to the community and the recipe will allow other groups to enhance upon it.
@misc{bakouch2025smollm3,
  title={{SmolLM3: smol, multilingual, long-context reasoner}},
  writer={Bakouch, Elie and Ben Allal, Loubna and Lozhkov, Anton and Tazi, Nouamane and Tunstall, Lewis and Patiño, Carlos Miguel and Beeching, Edward and Roucher, Aymeric and Reedi, Aksel Joonas and Gallouédec, Quentin and Rasul, Kashif and Habib, Nathan and Fourrier, Clémentine and Kydlicek, Hynek and Penedo, Guilherme and Larcher, Hugo and Morlon, Mathieu and Srivastav, Vaibhav and Lochner, Joshua and Nguyen, Xuan-Son and Raffel, Colin and von Werra, Leandro and Wolf, Thomas},
  yr={2025},
  howpublished={url{https://huggingface.co/blog/smollm3}}
}





Source link 
ASK ANA
 
  
0
0

SmolLM3: smol, multilingual, long-context reasoner

Architecture and training details

Data mixture and training stages

Long Context extension

Reasoning Mid-training

Constructing the Chat Template

Supervised Finetuning

Off-policy model alignment with Anchored Preference Optimization (APO)

Model Merging

Base model

Dual Instruct / Reasoning model

No extending pondering evaluation

Extending pondering evaluation

Enabling and Disabling Prolonged Pondering Mode

Agentic Usage

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Claude Skills and Subagents: Escaping the Prompt Engineering Hamster Wheel

Scaling ML Inference on Databricks: Liquid or Partitioned? Salted or Not?

Statement from Dario Amodei on our discussions with the Department of War Anthropic

Google quantum-proofs HTTPS by squeezing 2.5kB of information into 64-byte space – Ars Technica

Generative AI, Discriminative Human

SmolLM3: smol, multilingual, long-context reasoner

Architecture and training details

Data mixture and training stages

Long Context extension

Reasoning Mid-training

Constructing the Chat Template

Supervised Finetuning

Off-policy model alignment with Anchored Preference Optimization (APO)

Model Merging

Base model

Dual Instruct / Reasoning model

No extending pondering evaluation

Extending pondering evaluation

Enabling and Disabling Prolonged Pondering Mode

Agentic Usage

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.