A Family of Hybrid-Head Language Models Redefining Efficiency and Performance

Introduction

Take a look at also our official blogpost

Today, we’re proud to introduce the Falcon-H1 series, a group of six open-source models starting from 0.5B to 34B parameters, each available in each base and instruction-tuned variants. On the core of those models lies a hybrid architecture that mixes the strengths of the classical Transformer-based attention mechanism with the State Space Model (SSM), known for its superior long-context memory and computational efficiency. This architectural innovation is further enhanced by fundamental advancements in training dynamics and data utilization, enabling Falcon-H1 models to deliver uncompromised performance that rivals the highest Transformer-based models across all covered size tiers.

On this release, we feature six open-weight models: 0.5B, 1.5B, 1.5B-Deep, 3B, 7B, and 34B, together with their instruct versions. All our open-source models are with a permissive license based on Apache 2.0.

Key Features of Falcon-H1

Hybrid Architecture (Attention + SSM):
We mix attention and Mamba-2 heads in parallel inside our hybrid mixer block. Importantly, the quantity of attention and mamba heads could be adjusted independently, allowing for an optimal attention/SSM ratio. This hybrid design enables faster inference, lower memory usage, and strong generalization across tasks.
Wide Range of Model Sizes:
Available in six scales—0.5B, 1.5B, 1.5B-deep, 3B, 7B, and 34B—with each base and instruction-tuned variants, suitable for all the pieces from edge devices to large-scale deployments.
Multilingual by Design:
Supports 18 languages natively, including Arabic (ar), Czech (cs), German (de), English (en), Spanish (es), French (fr), Hindi (hi), Italian (it), Japanese (ja), Korean (ko), Dutch (nl), Polish (pl), Portuguese (pt), Romanian (ro), Russian (ru), Swedish (sv), Urdu (ur), and Chinese (zh) — with scalability to 100+ languages, due to our multilingual tokenizer trained on diverse language datasets.
Compact Models, Big Performance:
Falcon-H1-0.5B delivers performance on par with typical 7B models from 2024, while Falcon-H1-1.5B-Deep rivals most of the current leading 7B–10B models. Each Falcon-H1 model is designed to match or exceed the performance of models at the least twice its size, making them ideal for low-resource and edge deployments without compromising on capability.
256K Context Support:
Falcon-H1 models support as much as 256K context length, enabling applications in long-document processing, multi-turn dialogue, and long-range reasoning.
Exceptional STEM capabilities:
Falcon-H1 models deliver strong performance in math and science domains due to the deal with high-quality STEM data during training.
Robust Training Strategy:
Uses a high-efficiency data strategy and customised Maximal Update Parametrization (μP) to make sure smooth and scalable training across model sizes.

Foremost Principles behind constructing Falcon-H1

When embarking on the Falcon-H1 series development, we selected to fundamentally rethink the training approach. While the sphere of LLM development has converged on many established practices that reliably produce strong models, these conventions were primarily validated on classical transformer architectures. The shift from pure attention mechanisms to a hybrid attention-SSM design represents a major architectural change, making it uncertain whether these standard practices would remain optimal.

Given this uncertainty, we conducted an in depth experimentation phase, systematically revisiting nearly every aspect of model design and training methodology before launching our final training runs. While we are going to provide comprehensive details in our upcoming technical report, we might wish to share the important thing insights that shaped the Falcon-H1 models.

Architecture

The hybrid attention-SSM models have a bigger configuration space of all of the parameters that outline the model architecture. Our goal was to probe each of those configuration parameters to ascertain its impact on model performance and efficiency. In consequence, we reveal regions of model configuration space with an increased performance at a gentle efficiency cost. We are able to roughly divide the hybrid model configuration space in the next 4 blocks:

SSM specific parameters. Our SSM layer relies on mamba-2 architecture that organized into groups of heads, just like attention in modern transformer models. We have now found that deviation of the variety of groups or heads from the values typically utilized in the literature doesn’t improve performance but could degrade efficiency. In contrast, using a bigger memory size, an SSM-specific variable that doesn’t have an attention analog, gives a lift in performance with only a gentle efficiency cost.
Attention specific parameters. We employ an ordinary full attention layer. Nevertheless, we now have found that using a particularly large-scale parameter in rotary positional embeddings (RoPE) significantly improves the model performance. Our hypothesis is that, in comparison with pure transformers, in hybrid models such large values turn out to be possible since some positional information is natively processed by the SSM a part of the model.
Combining mamba and a spotlight. There are lots of ways to mix attention and SSM in a single model, with a sequential or parallel approach being the foremost design selection. We have now converged on the parallel approach demonstrated within the diagram above. The important thing feature of our parallel hybrid design is the potential of adjusting the ratio of attention and SSM heads, where we now have found that a comparatively small fraction of attention is sufficient for good performance.
General parameters. In our experiments we observed the increased model depth to have the biggest impact on the performance, though at efficiency cost. This makes selecting the model’s depth a tricky tradeoff that is dependent upon specific use cases. Our Falcon-H1-1.5B-deep is motivated by this tradeoff and targets usage scenarios requiring maximal performance at a small parameter count.

Data strategy

Capabilities of language models are known to return mainly from the training data, and that stays true for Falcon-H1 series. Besides the raw data prepared for the model, it’s crucial how and when this data is shown during training. One such data strategy is often called curriculum learning, where simpler data is shown at the start of the training while the samples requiring more advanced reasoning are left for the top. Surprisingly, a totally opposite strategy worked best for us. Giving even essentially the most complicated data, a complicated math problem or a protracted context sample, from the start of the training seems to offer the model more time to learn features essential for handling the respective complex tasks.

One other key aspect is the scarcity of high-quality data. A standard concern when training large models is brute force memorization of the information versus its real understanding. To reduce the chance of such memorization, a typical practice will not be reusing data samples during training, or doing it at most a number of times for the very best quality samples. A by-product of this strategy is data mixture being dominated by web samples which have disproportionally large volume in comparison with high-quality sources. We have now found that the memorization effect may be a bit overestimated, and punctiliously estimating model’s memorization window allows to reuse high-quality samples more often with none harm to model’s generalization ability.

Customized maximal update parametrization (μP)

Classical μP is a way heavily rooted in theory of neural networks but with a transparent practical application: if one finds optimal training hyperparameters at a single base model size, it could actually be effortlessly transferred to other, typically larger, model sizes using Mup scaling rules. We employed Mup hyperparameter transfer for the entire Falcon-H1 series, greatly reducing experimentation time and making it possible to coach 6 models in parallel.

On top of that, we made the following step into inner workings behind μP to further boost the model performance. In a nutshell, each component of the model “wants” to coach at its own intensity, and that intensity is dependent upon the scale of the component. μP scaling rules consider this dependence through so-called “μP multipliers” to enable optimal hyperparameter transfer. Nevertheless, classical μP uses trivial multipliers of 1 at the bottom model size, which corresponds to a nasusmption that intensity of all components are already optimal at the bottom size. We discard this assumption and tune the multipliers at the bottom model size. Specifically, we now have divided model parameters into 35 fine-grained groups and performed a joint optimization of the respective 35 multipliers.

Training dynamics

Considered one of our first steps in working on Falcon-H1 series was treating and removing spikes which are known to be a serious issue for SSM-based models. The answer that has worked the most effective for us is placing dampening μP multipliers at a certain location of the SSM block. Along with the graceful final model training, the removal of spikes is crucial to get clean signals in the following experiments.

We have now observed that many points of the training dynamics are linked together under a typical theme of noise interpretation and control. This includes learning rate and batch size schedules, scaling of the educational rate with batch size, and the behavior of parameter norms. Specifically, we now have found the parameter norms to be mostly determined by the training hyperparameters fairly than the model fitting the information. To take this under consideration, we now have included weight decay, a hyperparameter that primarily controls parameter norms, into each the training schedule and μP multipliers.

Performance

Instruct Models

The present Falcon-H1 models were trained without reasoning-specific fine-tuning, yet they already exhibit strong general instruction-following capabilities. To spotlight their performance, we present an in depth comparison of Falcon-H1-34B-Instruct against other top-performing Transformer models of comparable or larger scales, including: Qwen3-32B (non-thinking mode), Qwen2.5-72B, Qwen2.5-32B, Gemma3-27B, Llama-4-Scout-17B-16E (109B) and LLaMA3.3-70B. For full evaluation settings and methodology, please check with the Falcon-H1 GitHub page.

Considered one of the standout features of the Falcon-H1 series is the strong performance of its compact models. Below, we compare 1.5B-scale instruct models. Falcon-H1-1.5B-Deep-Instruct clearly outperforms leading models in its class, comparable to Qwen3-1.7B-Instruct. Much more notably, it performs on par with—or higher than many 7B models, including Falcon3-7B-Instruct and Qwen2.5-7B-Instruct.

🔎 Note: Falcon-H1-1.5B-Deep and Falcon-H1-1.5B were trained using equivalent settings; the one difference lies of their architectural depth and width.

Multilingual capabilities

To offer an image of Falcon-H1 performance across languages, we offer average between Hellaswag and MMLU scores for 30B scale models and for a set of chosen languages, including Arabic, German, Spanish, French, Hindi, Italian, Dutch, Portuguese, Romanian, Russian, and Swedish. It also demonstrates on-par performance in the opposite supported languages.

Long Context Benchmarks

Considered one of the standout features of Falcon-H1 is its ability to handle long-context inputs, an area where State Space Models (SSMs) offer significant benefits by way of memory efficiency and computational cost.

To exhibit these capabilities, we evaluate Falcon-H1-34B-Instruct against Qwen2.5-72B-Instruct across a set of long-context benchmarks. We deal with three core task categories drawn from the Helmet benchmark suite – Retrieval-Augmented Generation (RAG): Natural Questions, TriviaQA, PopQA, HotpotQA; Recall tasks: JSON KV, RULER MK Needle, RULER MK UUID, RULER MV; Long Document QA tasks: ∞BENCH QA, ∞BENCH MC. These evaluations highlight Falcon-H1’s strength in scaling to longer sequences while maintaining high performance and efficiency.

As well as, we conducted a comprehensive evaluation of the Falcon-H1 series alongside leading Transformer-based models across 23 benchmarks, covering multiple domains and model scales. You may explore the interactive results below—simply select the benchmarks most relevant to your use case to view the corresponding aggregated performance scores (below is a screenshort of our interactive plot within the official blogpost).

Base Models

We offer an in depth comparison of Falcon-H1-34B-Base with other leading base models at the identical or larger scale, including Qwen2.5-72B, Qwen2.5-32B, Llama-4-Scout-17B-16E (109B) and Gemma3-27B.

🔎 Note: Qwen3-32B doesn’t currently offer a base model checkpoint.

Below, we compare 1.5B-scale base models. Falcon-H1-1.5B-Deep-Base clearly outperforms leading models in its class, comparable to Qwen3-1.7B-Base. Notably, it performs on par with Falcon3-7B, and even exceeds it on math and reasoning tasks, making it a superb foundation for constructing small-scale reasoning-focused models.

For the bottom models, we also provide an interactive plot in our official blogpost showcasing their performance across 14 benchmarks, spanning multiple domains and various model scales (below is a screenshort of our interactive plot within the official blogpost).

Model Efficiency

We compare input (prefill) and output (generation) throughput between Falcon-H1 and Qwen2.5-32B within the plots below. While Transformers are barely faster at shorter context lengths, our hybrid model becomes significantly more efficient because the context grows—achieving as much as 4× speedup in input throughput and 8× in output throughput at longer sequence lengths. Benchmarks were run using our Falcon-H1 vLLM implementation and the official vLLM implementation of Qwen2.5-32B.

This performance gain highlights the scalability of the Falcon-H1 architecture. We attribute the throughput gap at small context lengths to the more mature optimizations of attention mechanisms, in comparison with current State Space Models (SSMs) implementations, in current inference pipelines.

⚙️ We invite the community to contribute to further optimizing SSM implementations — a promising direction for advancing the following generation of efficient LLMs.

🔎 Note: Input throughput measures how briskly models process tokens when reading/encoding text. Output throughput measures how briskly they generate recent tokens. The ratio line shows Falcon-H1/Qwen2.5 performance comparison.

Open Source Commitment

Consistent with our mission to foster AI accessibility and collaboration, Falcon-H1 is released under the Falcon LLM license. We hope the AI community finds these models worthwhile for research, application development, and further experimentation. Falcon-H1 is a continuation of our efforts to create more capable and efficient foundation models. We welcome feedback and collaboration from the community as we proceed to refine and advance the capabilities of those models.

Useful Links

Citation

@misc{tiifalconh1,
    title = {Falcon-H1: A Family of Hybrid-Head Language Models Redefining Efficiency and Performance},
    url = {https://falcon-lm.github.io/blog/falcon-h1},
    creator = {Falcon-LLM Team},
    month = {May},
    12 months = {2025}
}

Source link

A Family of Hybrid-Head Language Models Redefining Efficiency and Performance

Introduction

Key Features of Falcon-H1

Foremost Principles behind constructing Falcon-H1

Architecture

Data strategy

Customized maximal update parametrization (μP)

Training dynamics

Performance

Instruct Models

Multilingual capabilities

Long Context Benchmarks

Base Models

Model Efficiency

Open Source Commitment

Useful Links

Citation

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Supply chains, AI, and the cloud: The most important failures (and one success) of 2025

Our Transformers Code Agent beats the GAIA benchmark 🏅

What Advent of Code Has Taught Me About Data Science

From prophet to product: How AI got here back right down to earth in 2025

Accelerating Protein Language Model ProtST on Intel Gaudi 2

A Family of Hybrid-Head Language Models Redefining Efficiency and Performance

Introduction

Key Features of Falcon-H1

Foremost Principles behind constructing Falcon-H1

Architecture

Data strategy

Customized maximal update parametrization (μP)

Training dynamics

Performance

Instruct Models

Multilingual capabilities

Long Context Benchmarks

Base Models

Model Efficiency

Open Source Commitment

Useful Links

Citation

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.