Bamba: Inference-Efficient Hybrid Mamba2 Model

-


Bamba

We introduce Bamba-9B, an inference-efficient Hybrid Mamba2 model trained by IBM, Princeton, CMU, and UIUC on completely open data. At inference time, the model demonstrates 2.5x throughput improvement and 2x latency speedup compared to plain transformers in vLLM. To foster community experimentation, the model is straight away available to make use of in transformers, vLLM, TRL, and llama.cpp. We also release tuning, training, and prolonged pretraining recipes with a stateful data loader, and invite the community to further improve this model. Let’s overcome the KV-cache bottleneck together!



Artifacts 📦

  1. Hugging Face Bamba collection
  2. GitHub repo with inference, training, and tuning scripts
  3. Data loader
  4. Quantization
  5. Auto-pilot for cluster monitoring



Motivation 🌟

Transformer models are increasingly utilized in real-world applications, but they face memory-bandwidth bottlenecks during inference, particularly during per-token decoding in longer context-length models. Techniques like lower precision, layer pruning, and compression can alleviate the issue, but don’t address the foundation cause, which is the increasing amount of memory required by the KV-cache because the context length increases. Emerging architectures comparable to Mamba, Griffin, and DeltaNet eliminate this bottleneck by making the KV-cache size constant. The Mamba architecture has gained significant traction in the neighborhood within the recent past. For instance, Jamba and Samba interleave Mamba layers with transformer layers and explore the resulting hybrid Mamba models. Codestral Mamba, a pure Mamba2 model, demonstrates state-of-the-art (SOTA) results on coding tasks, while NVIDIA’s hybrid Mamba2 model achieves competitive performance across long-context and traditional LLM benchmarks. Recent innovations, like Falcon Mamba and Falcon 3 Mamba achieve SOTA rankings on Hugging Face leaderboards on the time of their releases.

We introduce Bamba-9B, a hybrid Mamba2 model trained on 2.2T tokens, further validating these emerging architectures. This collaboration between IBM, Princeton, CMU, and UIUC provides full training lineage, model checkpoints, and pretraining code to support reproducibility and experimentation. The training dataset of the released checkpoints doesn’t contain any benchmark-aligned instruction data (except FLAN) to preserve prolonged pretraining and fine-tuning flexibility. Our aim is to showcase the hybrid Mamba2 architecture’s potential by demonstrating strong performance at lower-mid size model scale (7B-10B) and to supply the community with checkpoints which might be fully reproducible and trained with open datasets.

To foster community experimentation, we’re also releasing a distributed stateless shuffle data loader and enabling hybrid Mamba2 architecture in open-source libraries like transformers, TRL, vLLM, and llama.cpp. We hope these efforts advance the adoption of Mamba architectures, alleviate KV-cache bottlenecks, and shut the gap with SOTA open-source models.



Use in transformers 🤗

To make use of Bamba with transformers, you need to use the familiar AutoModel classes and the generate API. For more details, please follow the instructions outlined in Bamba GitHub.

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("ibm-fms/Bamba-9B")
tokenizer = AutoTokenizer.from_pretrained("ibm-fms/Bamba-9B")

message = ["Mamba is a snake with following properties  "]
inputs = tokenizer(message, return_tensors='pt', return_token_type_ids=False)
response = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])



Evaluations 📊

We divide our evaluations into three parts:

  1. Comparison with SoTA transformer models
  2. Comparison with transformer models with similar token budget
  3. Comparison with other Mamba variants.

Evaluation setup ⚙️ 🖥️:
We rerun all of the benchmarks following the setup and scripts here for all models except the NVIDIA Mamba2 Hybrid model. We couldn’t run benchmarking for the NVIDIA Mamba2 Hybrid model because the model weights will not be in Hugging Face transformers compatible format. Due to this fact, we report the numbers from the unique paper. For the v2 leaderboard results, we perform normalization and report the normalized results. In all of the evaluations, higher is healthier except where indicated.



TL;DR Evals

Bamba-9B demonstrates the competitive performance of hybrid Mamba models in comparison with transformer models. While it has gaps in math benchmarks and MMLU scores (MMLU, GSM8K, MMLU-PRO, MATH Lvl 5), excluding these benchmarks places its average performance nearly on par with Meta Llama 3.1 8B (44.68 for Llama and 45.53 for Bamba), a model trained on 7x more data. These gaps might be addressed by (a) extending pretraining with more tokens (MMLU scores steadily improved during training), and (b) incorporating high-quality math data within the pretraining/annealing phases. Future plans include using updated datasets like Olmo2 mix and annealing with benchmark-aligned mixes comparable to Dolmino mix.

Bamba-9B’s results also alleviate concerns of relatively low scores of NVIDIA’s hybrid Mamba2 model in leaderboard benchmarks. The goal of NVIDIA’s study was to check architectures under an identical conditions. Consistent with their findings, Bamba-9B reaffirms that the hybrid Mamba2 architecture offers performance competitive to transformer models while providing as much as 5x inference efficiency.



Comparison with SoTA transformer models

We compare Bamba-9B with SoTA transformer models of comparable size (Meta Llama 3.1 8B, IBM Granite v3 8B, Olmo2 7B, and Gemma 2 9B). We observe that while there are obvious benchmark gaps, it is just not clear that these gaps point to deficiencies within the Mamba/Mamba2 based models. In reality, a careful evaluation shows that gaps are largely as a result of the quantity of information used for training models and the inclusion of benchmark-aligned instruction datasets throughout the annealing phase. For instance, we had one small scale run that added metamath and improved our GSM8k rating from 36.77 to 60.0. We are going to publish detailed evaluation and our findings in an upcoming paper.

HF OpenLLM v1 leaderboard

HF LLM-V1 + OpenbookQA and PIQA:

Model Average MMLU ARC-C GSM8K Hellaswag OpenbookQA Piqa TruthfulQA Winogrande
Bamba 9B 62.31 60.77 63.23 36.77 81.8 47.6 82.26 49.21 76.87
Meta Llama 3.1 8B 63.51 66.26 57.85 49.96 81.98 46.8 82.54 45.16 77.51
Olmo2 7B 66.17 63.96 64.51 68.01 81.93 49.2 81.39 43.32 77.03
IBM Granite v3 8B 67.47 65.45 63.74 62.55 83.29 47.6 83.41 52.89 80.82
Gemma2 9B 68.38 72.29 68.26 67.4 82.56 47.8 83.24 45.39 80.11
Qwen2.5 7B 70.58 75.41 63.82 83.24 80.23 48.40 81.28 56.34 75.93

HF LLM-V2** :

Model Average MMLU-PRO BBH GPQA IFEval MATH Lvl 5 MuSR
Bamba 9B 10.91 17.53 17.4 4.14 15.16 1.66 9.59
Meta Llama 3.1 8B 14.27 25.46 25.16 8.61 12.55 5.14 8.72
Olmo2 7B 13.36 22.79 21.69 4.92 16.35 4.38 10.02
IBM Granite v3 8B 21.14 25.83 28.02 9.06 44.79 9.82 9.32
Gemma2 9B 21.79 34.84 34.81 11.07 21.28 13.44 15.3
Qwen2.5 7B 25.13 37.62 35.62 9.96 34.77 18.35 14.6
Safety evals

Safety benchmarks are crucial for ensuring AI models generate content that is moral, inclusive, and non-harmful. We evaluate our model on well-known safety benchmarks comparable to Toxigen (5-shot, logits) (focused on detecting toxic language), BBQ (5-shot, generation), PopQA (5-shot, generation), and CrowS-Pairs (5-shot, logits) (measure bias and fairness). We intend to handle these gaps in safety through comprehensive SFT and DPO approaches.

*Lower is healthier



Comparison with transformer models with similar token budget

We pick a couple of distinguished models: Olmo 7B trained on an identical data (2024), Meta Llama 2 7B (2023), and IBM Granite 7B (2023), which have been trained to ~2T tokens. Amongst these transformer models, Olmo 7B has the best average rating across 8 key benchmarks. Bamba-9B outperforms Olmo 7B trained on an identical variety of tokens and datasets. Because the Bamba-9B model has 9B parameters, a direct comparison is again difficult, however the most important takeaway is that hybrid Mamba2 models are competitive with the transformer models trained with similar token budget.

Model Average MMLU ARC-C GSM8K Hellaswag OpenbookQA Piqa TruthfulQA Winogrande
Bamba 9B (2.2T) 62.31 60.77 63.23 36.77 81.8 47.6 82.26 49.21 76.87
Olmo1.5 7B (2T) 55.8 53.38 50.51 27.67 79.13 45.2 81.56 35.92 73.09
Bamba 9B (2T) 59.11 59.05 57.25 24.03 83.66 47.6 83.62 38.26 79.4
Meta Llama2 7B (2T) 53.78 46.64 52.65 13.57 78.95 45.2 80.03 38.96 74.27
IBM Granite 7B (2T) 52.07 49.02 49.91 10.84 77.0 40.8 80.14 38.7 70.17
Mamba/Mamba2 comparisons



Comparison with Mamba/Mamba2 architecture based language models

Multiple Mamba/Mamba2 architecture based models have began emerging within the last 6 months (e.g., NVIDIA hybrid Mamba2, Codestral Mamba, Falcon Mamba, and Zamba 7B v1) furthering the performance of those architectures and demonstrating their superior inference performance in addition to closing the gaps in benchmark results with transformer models. We compare 8 key benchmarks across Bamba-9B, NVIDIA hybrid Mamba2, Zamba, and Falcon Mamba.

Falcon Mamba is a pure Mamba model, Zamba has shared attention layer for each 6 Mamba layers, and Bamba-9B and NVIDIA are each hybrid models with full attention layers interspersed with Mamba2 layers. Falcon Mamba was trained to five.5T tokens and it performs the most effective overall but there are open questions on how well it’s going to perform on long-context tasks where Mamba-based architectures truly shine of their inference performance. Zamba was trained on fewer tokens (1T), but with a unique hybrid architecture and using benchmark-aligned instruction datasets including those generated from more powerful language models. Bamba-9B and NVIDIA hybrid Mamba2 are quite much like one another (details on differences are summarized within the model architecture section), but Bamba-9B is trained to 2.2T tokens while NVIDIA Hybrid Mamba is trained to three.5T tokens.

Note: As of writing this blog, Falcon3 Mamba 7B has landed with even higher results than Falcon Mamba. We plan to leverages any learnings from Falcon3 Mamba and improve our next Bamba release.

Model Average MMLU ARC-C GSM8K Hellaswag OpenbookQA Piqa TruthfulQA Winogrande
Bamba 9B 62.31 60.77 63.23 36.77 81.8 47.6 82.26 49.21 76.87
NVIDIA Mamba2 Hybrid 8B* 58.78 53.6 47.7 77.69 42.8 79.65 38.72 71.27
Zamba 7B 64.36 57.85 55.38 61.33 82.27 46.8 82.21 49.69 79.32
Falcon Mamba 7B 65.31 63.19 63.4 52.08 80.82 47.8 83.62 53.46 78.14

* Results are taken from NVIDIA paper.

💡 Note: The differences in training datasets and the variety of tokens seen during training make a direct comparison of those models difficult. The important thing takeaway from this table is that hybrid Mamba2 architectures can deliver competitive results while being nearly as efficient to coach as transformer models. Moreover, they’ll deliver significant improvement (theoretically as much as 5x) in inference efficiency despite having full attention layers interspersed with Mamba2 layers. We’re continuing to pretrain the Bamba-9B model with the most recent datasets and plan to release future checkpoints because the model improves.



Inference efficiency ⚡🏎️

The KV-cache bottleneck is a significant challenge for big language models, prompting solutions like quantization, pruning, and novel architectures comparable to Mamba2, Linear Transformers, and RetNets. Realizing inference efficiencies at scale, even with standard transformers, often requires custom kernels. Bamba-9B builds on the community momentum of kernel availability, with further improvements made through integration with the vLLM model-serving framework.

Our progress in vLLM integration, tracked via this PR, benchmarks Bamba-9B against Meta Llama 3.1 8B on an NVIDIA H100 80GB GPU. Using input sizes of 1K tokens and output sizes starting from 2K to 64K across various batch sizes, we measured throughput (tokens/second) and latency. Results show that as batch size and sequence length increase, Bamba-9B achieves as much as 2-2.5x higher throughput and latency in comparison with transformer models. These gains enhance real-time applications and GPU utilization, with higher throughput ratios (>1) and lower latency ratios (<1) being useful.

Figure 1
Figure 1: Throughput improvements of Bamba
Figure 2
Figure 2: Latency improvements of Bamba

Our evaluation indicates that on a H100 NVIDIA GPU, we expect 5x speedup when inference shifts to a memory bottleneck (which usually happens in production settings) – see the appendix on Arithmetic Intensity. Nevertheless, we’ve got not realized this speedup in vLLM yet due to three primary reasons:

  1. Chunked pre-fill is just not supported for Bamba and any Mamba2-based architectures
  2. Memory allocation assumes standard transformer KV-cache
  3. Mamba2 kernels will not be optimized for H100 GPUs

These issues are being tracked here.



Model architecture

We base our model architecture on the NVIDIA hybrid Mamba2 with the next changes.

Parameter Bamba 9B NVIDIA hybrid Mamba2 8B
Num layers 32 29
Num attention layers 3 4
Num Mamba2 layers 29 25
MLP expansion factor 3.5 4
Vocab size 128k 256k
Non-embedding parameters 8.8B 8.6B
RoPE yes no
Gated linear units yes no

We have now a complete of 8B parameters within the Mamba2 layers, 800M in full attention layers, and 1B in embeddings. The hidden state is 4K, GQA for full attention with 8 KV-heads and 32 heads, Mamba2 layer head dimension is 64, and convolution filter size is 4. Essentially the most significant change between the 2 models is reducing the total attention layers from 4 within the NVIDIA hybrid Mamba2 model to three in Bamba-9B and introduction of the RoPE embeddings.



Data

Open-source data has come a good distance because the inception of The Pile dataset. Once we began training this model, the most effective open-source data was Dolma v1.7, which was shown to be quite performant through the Olmo models and ablations by the Hugging Face data team. Since then, several other higher quality open source datasets have been released, comparable to DCLM, FineWeb-2, and Olmo2 mix.

We use Dolma v1.7 for the primary phase of coaching, and the chosen data mixes are illustrated below. For the second phase of coaching, we use Fineweb-edu and Cosmopedia. These datasets are downloaded of their raw form and we tokenize them using the Ray framework running on an internal large scale Red Hat Open Shift cluster. We plan to release the tokenized and formatted parquet data soon for reproducibility.

Datamix

Data mix for pretraining phase one



Pre-Training

Pre-training Bamba was done in a phased manner, we performed several ablation experiments at 1.8B model size and 100B tokens to find out the fitting learning rates. Based on the promising results from this study, we trained a bigger scale model – 3B to 2T tokens using Dolma mix. We also trained a 3B transformer model following Meta Llama architecture with the identical data mix and observed similar or higher performance for the Bamba model, which is inline with the conclusion reached by the NVIDIA study performed concurrently. Finally, we designed a 9B model architecture and retrained on the identical mix. PyTorch FSDP was used to coach all our models.

Training details:
We used a cosine learning rate schedule, with a peak learning rate of 3e−4, a quadratic warmup over 2000 steps, decay factor of 0.033, and an ending learning rate of 1e−5 over 2T tokens. We used the AdamW optimizer with β1 of 0.9 and β2 of 0.95. We used a weight decay of 0.1, sequence length of 4096, and a world batch size of 1.5M tokens/batch. We used 192 A100 GPUs from the IBM Cloud Vela production cluster to coach this model over a period of two months. This cluster is managed by Red Hat OpenShift. We experienced 3 job interruptions, which were attributed to an incorrect deployment of jobs and hardware failures. The hardware-related job failures were detected robotically using autopilot.

We also performed a second phase training with top quality data from Hugging Face FineWeb-edu and Cosmopedia for an extra 200B tokens. We used a learning rate of 2e−5 and a cosine schedule to anneal the model, which helped improve our scores. We’re currently experimenting with additional top quality data and can release any future checkpoints as a part of our commitment to open source.



Data loader

There are several features to training a high-quality language model, data loader is a very important one. Over the past 18 months we’ve got been working on a knowledge loader that satisfies the demands of huge scale distributed training. We open source this data loader to enable others to make use of it along side their framework of selection. We have now used it within the Bamba model training, and integrated it with Torch Titan. Up to now, we imagine that is the one open source data loader that gives a wealthy set of features.

The info loader provides the next key features:

  1. Stateful and checkpointable to make sure seamless resumption mid-epoch
  2. Auto-rescales to changing workload and GPU allocations
  3. Data streaming with zero-overhead for shuffling
  4. Asynchronous distributed operation with no peer-to-peer communication
  5. Allows for dynamic data mixing and on-the-fly tokenization
  6. PyTorch native, modular, and extensible

We have now battle tested this data loader over a whole bunch of coaching jobs and optimized it over months of continuous operation. The first code base is positioned in our repo here and we’ve got also worked with the Torch Titan team to make it available here. We’re working with the Meta PyTorch team to contribute this data loader into core PyTorch.



Quantization

We recently open sourced a framework for quantization of models. Through this framework, we leverage the llm-compressor to quantize the Bamba checkpoints to fp8. We observed minimal loss in accuracy across all of the benchmarks of the OpenLLM leaderboards. Specifically, for the Bamba 9B, a negligible difference of 0.1 between the common scores for V1 (from 62.31 to 61.5), and for V2 drop of 0.9 in the common (10.91 to 10.04). These quantized checkpoints are also released together with the bf16 counterparts. This also validates that Bamba models are amenable to quantization very like SoTA transformer models.

We’re within the technique of enabling fp8 inference for this model in vLLM, which can require updating the kernels. Linear layers and full attention will likely be easy to tackle, however the Mamba2 layers would require updates to the Triton/CUDA kernels to handle fp8.



Context length extension

We’re currently exploring various approaches to long context length extensions starting with applying LongRope to the total attention layers. Our preliminary findings using PhoneBook retrieval because the task indicate that LongRoPE might be applied to this model. We extend Bamba-9B’s context length by 4x and 8x and compare the context-extended Bamba-9B against three variations of Meta Llama – LLama2, Llama3, LLama3.1, with training context lengths of 4K, 8K, and 128K. The outcomes are plotted below.

Datamix

We observe that the context-extended Bamba-9B model performs exceptionally well as much as a 16K context length with none tuning, outperforming the unique Bamba-9B model, Llama2-7B, and Llama3-8B by a big margin and obtaining comparable performance as Llama3.1-8B. At sequence length 32K, LLama3.1 achieves the most effective performing result. We plan to release the long context length prolonged models when ready.



Summary 🎯

Bamba-9B, developed collaboratively by IBM, Princeton, CMU, and UIUC, is a robust performing hybrid Mamba2 model. The model is trained entirely on open datasets and we’re releasing intermediate and final checkpoints. To foster community experimentation, the model is straight away available to make use of in transformers, vLLM, TRL, and llama.cpp. We also release tuning, training, and prolonged pretraining recipes with a stateful data loader, and invite the community to further improve this model.

Key Takeaways:

  • Inference Efficiency: Bamba-9B delivers substantial improvements in throughput and latency, enhancing real-time application performance. Benchmarking using vLLM against Llama 3.1 8B demonstrates 2.5x throughput and 2x latency improvements with more coming soon!

  • Competitive Benchmarks: Bamba-9B performs competitively against state-of-the-art (SoTA) transformer models like Meta Llama 3.1 8B. It matches their average benchmark performance when excluding math and MMLU tasks, with opportunities to shut these gaps through prolonged training and math-focused datasets.

  • Open Collaboration: The model’s development utilized open data, promoting transparency and reproducibility inside the AI community.

For more details and access to the model and associated resources, visit the Bamba GitHub repository.



Future work

There are several directions that we intend to explore and further inference-efficient mamab2 hybrid architectures:

  1. Proceed to enhance the models through continued pretraining on additional data; We welcome any feedback from the community so we are able to collectively create a kick-ass Mamba2 hybrid model.
  2. Perform SFT of base models using SFT datasets comparable to Tuluv3, agent instruct, and Anteater and compare the resulting model to other state-of-the-art instruction-tuned models.
  3. vLLM enablement of the model working with the community. The problems on chunked prefill and managing the memory allocation for this architecture will likely be key.
  4. Enabling fp8 kernels to make inference even faster.
  5. Training time improvements and applying torch.compile in addition to fp8 training, each of which our team has demonstrated on transformer architectures working with Meta.
  6. Long context length extensions as much as 1M+



Contributors

  • Data collection and curation: We acknowledge and thank AllenAI team for making a top quality open source dataset Dolma in addition to Hugging Face data team for making FineWeb-edu and Cosmopedia available. These are tremendous contributions which enabled us to create the model.
  • Data preprocessing: We thank IBM’s internal data preprocessing team, specifically Tuan Hoang Trong, Syed Zawad, Jay Gala, and Ryan Gordon for helping tokenize the info at scale. The code for tokenization is out there here.
  • Model architecture: The model architecture design was jointly done by Princeton, CMU, IBM, and UIUC and involved the next folks: Tri Dao (Princeton), Albert Gu (CMU), Linsong Chu (IBM), Davis Wertheimer (IBM), Minjia Zhang (UIUC), Mudhakar Srivatsa (IBM), and Raghu Ganti (IBM).
  • Model training: Model training was performed primarily by the IBM team using the Mamba2 kernels and layer implementation from Tri Dao and Albert Gu. The next folks from IBM were primarily involved: Linsong Chu, Divya Kumari, Davis Wertheimer, Raghu Ganti, and Dakshi Agrawal.
  • Model tuning: Tuning of the model was enabled and verified in TRL by the IBM team, involving Sukriti Sharma and Anh Uong.
  • Model inference: Model inference in transformers, vLLM, and llama.cpp builds on the kernels written by Princeton and CMU. The IBM team is working with the community to enable it in various ecosystems. The team includes Fabian Lim, Antoni viros i Martin, Adnan Hoque, Jamie Yang, Nelson Nimura Gonzalez, Joshua Rosenkranz, Nick Hill, and Gabe Goodhart.
  • Quantization: Quantization is led by the IBM team – Naigang Wang and Charlie Liu.
  • Evaluations: Evaluations are led by a team in IBM with long context evaluations being performed by UIUC, involving the next folks: Yotam Perlitz, Ofir Arviv, Michal Shmueli-Scheuer (IBM), Haoechen Shen, and Minjia Zhang (UIUC).

Finally, we would love to thank our leadership for his or her support on this effort – Priya Nagpurkar, David Cox, Sriram Raghavan, Aya Soffer, Ruchir Puri, and Mukesh Khare.

We might also wish to thank the community, specifically Pablo Montalvo-Leroux, Aritra Roy Gosthipaty, and Vaibhav Srivastav from Hugging Face and Stas Bekman from Contextual AI who provided priceless feedback to this blog and the PRs into transformers. Further, we would love to thank Tyler Michael Smith from Neural Magic, who’s shepherding the combination with vLLM.

An enormous shoutout to Meta PyTorch, AllenAI, and Hugging Face teams for his or her contributions to the open initative, PyTorch FSDP allowed us to easily train this model and the info from Dolma and Fineweb/Cosmopedia made this model today!



Appendix: Arithmetic Intensity

Using the next notation:
$b$: batch size
$s$: sequence length
$h$: hidden state size (4096)
$d$: head dimension (128)
$l$: total variety of layers (32)
$l_{attn}$: variety of attention layers (3)
$l_{ssd}$: variety of SSD layers (29)

Each the eye and Bamba models are configured with GQA of 4:1 (in the eye layers), MLP expansion ratio of three.5 and use GLU in MLP block. The SSD layer in Bamba is configured with state dimension of $d$, head dimension of $d/2$ and variety of heads = $4h/d$. Model size excluding the embedding layer is:

Model Type Model Size
attention $13h^2l$
Bamba $15.5h^2l$

At prefill stage the compute and memory (read + write) requirements imposed by the model are:

Model Type Compute Prefill Memory Prefill
attention $26bsh^2l + 4bs^2hl$ $13h^2l + 0.5bshl$
Bamba $31bsh^2l + 4bs^2hl_{attn} + 4bsdhl_{ssd}$ $15.5h^2l + 0.5bshl_{attn} + 4bdhl_{ssd}$

At decode stage the compute and memory (read + write) requirements imposed by the model are:

Model Type Compute Decode Memory Decode
attention $26bh^2l + 4bshl$ $13h^2l + 0.5bshl$
Bamba $31bh^2l + 4bshl_{attn} + 4bdhl_{ssd}$ $15.5h^2l + 0.5bshl_{attn} + 4bdhl_{ssd}$

A comparison of compute flops during prefill stage and memory (read+write) sizes during decode stage between Bamba and LLaMa models is shown below. Note that ratios lesser than 1 are useful. With inference throughput primarily bottlenecked by decode stage, the potential speedup for Bamba (over LLaMa) is 5x for big sequence lengths (> 16K). Current measurements (on vLLM) hover at 2.5x, which we expect to enhance within the near future.

ArithmeticIntensity



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x