The primary strong attention-free 7B model

Falcon Mamba is a brand new model by Technology Innovation Institute (TII) in Abu Dhabi released under the TII Falcon Mamba 7B License 1.0. The model is open access and available inside the Hugging Face ecosystem here for anyone to make use of for his or her research or application purposes.

On this blog, we are going to undergo the design decisions behind the model, how the model is competitive with respect to other existing SoTA models, and methods to use it inside the Hugging Face ecosystem.

First general purpose large-scale pure Mamba model

Transformers, based on the eye mechanism, are the dominant architecture utilized in all of the strongest large language models today. Yet, the eye mechanism is fundamentally limited in processing large sequences as a result of the rise in compute and memory costs with sequence length. Various alternative architectures, particularly State Space Language Models (SSLMs), tried to deal with the sequence scaling limitation but fell back in performance in comparison with SoTA transformers.

With Falcon Mamba, we display that sequence scaling limitation can indeed be overcome without loss in performance. Falcon Mamba relies on the unique Mamba architecture, proposed in Mamba: Linear-Time Sequence Modeling with Selective State Spaces, with the addition of additional RMS normalization layers to make sure stable training at scale. This selection of architecture ensures that Falcon Mamba:

can process sequences of arbitrary length with none increase in memory storage, particularly, fitting on a single A10 24GB GPU.
takes a relentless period of time to generate a brand new token, no matter the dimensions of the context (see this section)

Model training

Falcon Mamba was trained with ~ 5500GT of information, mainly composed of RefinedWeb data with addition of high-quality technical data and code data from public sources. We used constant learning rate for probably the most of the training, followed by a comparatively short learning rate decay stage. On this last stage, we also added a small portion of high-quality curated data to further enhance model performance.

Evaluations

We evaluate our model on all benchmarks of the brand new leaderboard’s version using the lm-evaluation-harness package after which normalize the evaluation results with Hugging Face rating normalization.

`model name`	`IFEval`	`BBH`	`MATH LvL5`	`GPQA`	`MUSR`	`MMLU-PRO`	`Average`
Pure SSM models
`Falcon Mamba-7B`	33.36	19.88	3.63	8.05	10.86	14.47	15.04
`TRI-ML/mamba-7b-rw`^*	22.46	6.71	0.45	1.12	5.51	1.69	6.25
Hybrid SSM-attention models
`recurrentgemma-9b`	30.76	14.80	4.83	4.70	6.60	17.88	13.20
`Zyphra/Zamba-7B-v1`^*	24.06	21.12	3.32	3.03	7.74	16.02	12.55
Transformer models
`Falcon2-11B`	32.61	21.94	2.34	2.80	7.53	15.44	13.78
`Meta-Llama-3-8B`	14.55	24.50	3.25	7.38	6.24	24.55	13.41
`Meta-Llama-3.1-8B`	12.70	25.29	4.61	6.15	8.98	24.95	13.78
`Mistral-7B-v0.1`	23.86	22.02	2.49	5.59	10.68	22.36	14.50
`Mistral-Nemo-Base-2407 (12B)`	16.83	29.37	4.98	5.82	6.52	27.46	15.08
`gemma-7B`	26.59	21.12	6.42	4.92	10.98	21.64	15.28

Also, we evaluate our model on the benchmarks of the primary version of the LLM Leaderboard using lighteval.

`model name`	`ARC`	`HellaSwag`	`MMLU`	`Winogrande`	`TruthfulQA`	`GSM8K`	`Average`
Pure SSM models
`Falcon Mamba-7B`^*	62.03	80.82	62.11	73.64	53.42	52.54	64.09
`TRI-ML/mamba-7b-rw`^*	51.25	80.85	33.41	71.11	32.08	4.70	45.52
Hybrid SSM-attention models
`recurrentgemma-9b`^**	52.00	80.40	60.50	73.60	38.60	42.60	57.95
`Zyphra/Zamba-7B-v1`^*	56.14	82.23	58.11	79.87	52.88	30.78	60.00
Transformer models
`Falcon2-11B`	59.73	82.91	58.37	78.30	52.56	53.83	64.28
`Meta-Llama-3-8B`	60.24	82.23	66.70	78.45	42.93	45.19	62.62
`Meta-Llama-3.1-8B`	58.53	82.13	66.43	74.35	44.29	47.92	62.28
`Mistral-7B-v0.1`	59.98	83.31	64.16	78.37	42.15	37.83	60.97
`gemma-7B`	61.09	82.20	64.56	79.01	44.79	50.87	63.75

For the models marked by star, we evaluated the tasks internally, while for the models marked by two stars, the outcomes were taken from paper or model card.

Processing large sequences

Following theoretical efficiency SSM models in processing large sequences, we perform a comparison of memory usage and generation throughput between Falcon Mamba and popular transfomer models using the
optimum-benchmark library. For a good comparison, we rescaled the vocabulary size of all transformer models to match Falcon Mamba because it has a big effect on the memory requirements of the model.

Before going to the outcomes, let’s first discuss the difference between the prompt (prefill) and generated (decode) parts of the sequence. As we are going to see, the small print of prefill are more necessary for state space models than for transformer models. When a transformer generates the subsequent token, it needs to take care of the keys and values of all previous tokens within the context. This suggests linear scaling of each memory requirements and generation time with context length. A state space model attends to and stores only its recurrent state and, due to this fact, doesn’t need additional memory or time to generate large sequences. While this explains the claimed advantage of SSMs over transformers within the decode stage, the prefill stage requires additional effort to completely utilize SSM architecture.

An ordinary approach for prefill is to process the entire prompt in parallel to completely utilize GPU. This approach is utilized in optimum-benchmark library and we are going to confer with it as parallel prefill. Parallel prefill must store in memory the hidden states of every token within the prompt. For transformers, this extra memory is dominated by the memory of stored KV caches. For SSM models, no caching is required, and the memory for storing hidden states becomes the one component proportional to the prompt length. Consequently, the memory requirement will scale with prompt length, and SSM models will lose the flexibility to process arbitrary long sequences, just like transformers.

An alternative choice to parallel prefill is to process the prompt token by token, which we are going to confer with as sequential prefill. Akin to sequence parallelism, it may even be done on larger chunks of the prompt as a substitute of individual tokens for higher GPU usage. While sequential prefill makes little sense for transformers, it brings back the potential of processing arbitrary long prompts by SSM models.

With these remarks in mind, we first test the biggest sequence length that may fit on a single 24 GB A10 GPU, putting the outcomes on the figure below. The batch size is fixed at 1, and we’re using float32 precision. Even for parallel prefill, Falcon Mamba can fit larger sequences than a transformer, while in sequential prefill, it unlocks its full potential and may process arbitrary long prompt

Next, we measure the generation throughput in a setting with a prompt of length 1 and as much as 130k generated tokens, using batch size 1 and H100 GPU. The outcomes are reported within the figure below. We observe that our Falcon Mamba is generating all of the tokens at constant throughput and with none increase in CUDA peak memory. For the transformer model, the height memory grows, and generation speed slows down because the variety of generated tokens grows.

Methods to use it inside Hugging Face transformers?

The Falcon Mamba architecture shall be available in the subsequent release of the Hugging Face transformers library (>4.45.0). To make use of the model, be sure that to put in the newest version of Hugging Face transformers or install the library from the source.

Falcon Mamba is compatible with many of the APIs Hugging Face offers, which you’re accustomed to, similar to AutoModelForCausalLM or pipeline :

from transformers import AutoModelForCausalLM, AutoTokenizer 

model_id = "tiiuae/falcon-mamba-7b" 
tokenizer = AutoTokenizer.from_pretrained(model_id) 

model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto") 
inputs = tokenizer("Hello world, today", return_tensors="pt").to(0) 

output = model.generate(**inputs, max_new_tokens=100, do_sample=True) 
print(tokenizer.decode(Output[0], skip_special_tokens=True))

Because the model is large, it also supports features similar to bitsandbytes quantization to run the model on smaller GPU memory constraints, e.g.:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig 

model_id = "tiiuae/falcon-mamba-7b" 
tokenizer = AutoTokenizer.from_pretrained(model_id) 

quantization_config = BitsAndBytesConfig(load_in_4bit=True) 
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config) 

inputs = tokenizer("Hello world, today", return_tensors="pt").to(0) 
output = model.generate(**inputs, max_new_tokens=100, do_sample=True) 

print(tokenizer.decode(output[0], skip_special_tokens=True))

We’re also pleased to introduce the instruction-tuned version of Falcon Mamba, which has been fine-tuned with an extra 5 billion tokens of supervised fine-tuning (SFT) data. This prolonged training enhances the model’s ability to perform instructional tasks with higher precision and effectiveness. You’ll be able to experience the capabilities of the instruct model through our demo, available here. For the chat template we use the next format:

<|im_start|>user
prompt<|im_end|>
<|im_start|>assistant

You too can directly use the 4-bit converted version of each the base model and the instruct model. Ensure that to have access to a GPU that’s compatible with bitsandbytes library to run the quantized model.

You too can profit from faster inference using torch.compile; simply call model = torch.compile(model) once you’ve got loaded the model.

Acknowledgments

The authors of this blog post would love to thank the Hugging Face team for his or her smooth support and integration inside their ecosystem, particularly

The authors would also prefer to thank Tri Dao and Albert Gu for implementing and open-sourcing Mamba architecture to the community.

Source link

The primary strong attention-free 7B model

First general purpose large-scale pure Mamba model

Model training

Evaluations

Processing large sequences

Methods to use it inside Hugging Face transformers?

Acknowledgments

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Towards open and responsible AI licensing frameworks

The Machine Learning Lessons I’ve Learned Last Month

The way to train a Language Model with Megatron-LM

Why the Moltbook frenzy was like Pokémon

Automating Inference Optimizations with NVIDIA TensorRT LLM AutoDeploy

The primary strong attention-free 7B model

First general purpose large-scale pure Mamba model

Model training

Evaluations

Processing large sequences

Methods to use it inside Hugging Face transformers?

Acknowledgments

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.