Falcon is a brand new family of state-of-the-art language models created by the Technology Innovation Institute in Abu Dhabi, and released under the Apache 2.0 license. Notably, Falcon-40B is the primary “truly open” model with capabilities rivaling many current closed-source models. That is implausible news for practitioners, enthusiasts, and industry, because it opens the door for a lot of exciting use cases.
Note: Few months after this release, the Falcon team released a bigger model of 180 billion parameters.
On this blog, we will probably be taking a deep dive into the Falcon models: first discussing what makes them unique after which showcasing how easy it’s to construct on top of them (inference, quantization, finetuning, and more) with tools from the Hugging Face ecosystem.
Table of Contents
The Falcon models
The Falcon family consists of two base models: Falcon-40B and its little brother Falcon-7B. The 40B parameter model was at the highest of the Open LLM Leaderboard on the time of its release, while the 7B model was the very best in its weight class.
Note: the performance scores shown within the table below have been updated to account for the brand new methodology introduced in November 2023, which added recent benchmarks. More details in this post.
Falcon-40B requires ~90GB of GPU memory — that’s lots, but still lower than LLaMA-65B, which Falcon outperforms. Alternatively, Falcon-7B only needs ~15GB, making inference and finetuning accessible even on consumer hardware. (Later on this blog, we are going to discuss how we are able to leverage quantization to make Falcon-40B accessible even on cheaper GPUs!)
TII has also made available instruct versions of the models, Falcon-7B-Instruct and Falcon-40B-Instruct. These experimental variants have been finetuned on instructions and conversational data; they thus lend higher to popular assistant-style tasks. For those who are only trying to quickly play with the models they’re your best shot. It’s also possible to construct your individual custom instruct version, based on the plethora of datasets built by the community—keep reading for a step-by-step tutorial!
Falcon-7B and Falcon-40B have been trained on 1.5 trillion and 1 trillion tokens respectively, consistent with modern models optimising for inference. The important thing ingredient for the top quality of the Falcon models is their training data, predominantly based (>80%) on RefinedWeb — a novel massive web dataset based on CommonCrawl. As an alternative of gathering scattered curated sources, TII has focused on scaling and improving the standard of web data, leveraging large-scale deduplication and strict filtering to match the standard of other corpora. The Falcon models still include some curated sources of their training (akin to conversational data from Reddit), but significantly less so than has been common for state-of-the-art LLMs like GPT-3 or PaLM. The most effective part? TII has publicly released a 600 billion tokens extract of RefinedWeb for the community to make use of in their very own LLMs!
One other interesting feature of the Falcon models is their use of multiquery attention. The vanilla multihead attention scheme has one query, key, and value per head; multiquery as an alternative shares one key and value across all heads.
![]() |
|---|
| Multi-Query Attention shares keys and value embeddings across attention heads. Courtesy Harm de Vries. |
This trick doesn’t significantly influence pretraining, however it greatly improves the scalability of inference: indeed, the K,V-cache kept during autoregressive decoding is now significantly smaller (10-100 times depending on the particular of the architecture), reducing memory costs and enabling novel optimizations akin to statefulness.
| Model | License | Business use? | Pretraining length [tokens] | Pretraining compute [PF-days] | Leaderboard rating | K,V-cache size for a 2.048 context |
|---|---|---|---|---|---|---|
| StableLM-Alpha-7B | CC-BY-SA-4.0 | ✅ | 1,500B | 700 | 34.37 | 800MB |
| LLaMA-7B | LLaMA license | ❌ | 1,000B | 500 | 45.65 | 1,100MB |
| MPT-7B | Apache 2.0 | ✅ | 1,000B | 500 | 44.28 | 1,100MB |
| Falcon-7B | Apache 2.0 | ✅ | 1,500B | 700 | 44.17 | 20MB |
| LLaMA-33B | LLaMA license | ❌ | 1,500B | 3200 | – | 3,300MB |
| LLaMA-65B | LLaMA license | ❌ | 1,500B | 6300 | 61.19 | 5,400MB |
| Falcon-40B | Apache 2.0 | ✅ | 1,000B | 2800 | 58.07 | 240MB |
Demo
You may easily try the Big Falcon Model (40 billion parameters!) in this Space or within the playground embedded below:
Under the hood, this playground uses Hugging Face’s Text Generation Inference, a scalable Rust, Python, and gRPC server for fast & efficient text generation. It’s the identical technology that powers HuggingChat.
We have also built a Core ML version of the 7B instruct model, and that is the way it runs on an M1 MacBook Pro:
The video shows a light-weight app that leverages a Swift library for the heavy lifting: model loading, tokenization, input preparation, generation, and decoding. We’re busy constructing this library to empower developers to integrate powerful LLMs in all kinds of applications without having to reinvent the wheel. It’s still a bit rough, but we won’t wait to share it with you. Meanwhile, you possibly can download the Core ML weights from the repo and explore them yourself!
Inference
You should use the familiar transformers APIs to run the models on your individual hardware, but that you must listen to some of details:
- The models were trained using the
bfloat16datatype, so we recommend you utilize the identical. This requires a recent version of CUDA and works best on modern cards. Chances are you’ll also attempt to run inference usingfloat16, but bear in mind that the models were evaluated usingbfloat16. - You want to allow distant code execution. It is because the models use a brand new architecture that just isn’t a part of
transformersyet – as an alternative, the code obligatory is provided by the model authors within the repo. Specifically, these are the files whose code will probably be used should you allow distant execution (usingfalcon-7b-instructfor example): configuration_RW.py, modelling_RW.py.
With these considerations, you should utilize the transformers pipeline API to load the 7B instruction model like this:
from transformers import AutoTokenizer
import transformers
import torch
model = "tiiuae/falcon-7b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
)
After which, you’d run text generation using code like the next:
sequences = pipeline(
"Write a poem about Valencia.",
max_length=200,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")
And you could get something like the next:
Valencia, city of the sun
Town that glitters like a star
A city of a thousand colours
Where the night is illuminated by stars
Valencia, town of my heart
Where the past is kept in a golden chest
Inference of Falcon 40B
Running the 40B model is difficult due to its size: it doesn’t slot in a single A100 with 80 GB of RAM. Loading in 8-bit mode, it is feasible to run in about 45 GB of RAM, which inserts in an A6000 (48 GB) but not within the 40 GB version of the A100. That is the way you’d do it:
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
model_id = "tiiuae/falcon-40b-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
load_in_8bit=True,
device_map="auto",
)
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
)
Note, nevertheless, that mixed 8-bit inference will use torch.float16 as an alternative of torch.bfloat16, so be certain you test the outcomes thoroughly.
If you could have multiple cards and speed up installed, you possibly can benefit from device_map="auto" to routinely distribute the model layers across various cards. It may even offload some layers to the CPU if obligatory, but it will impact inference speed.
There’s also the likelihood to make use of 4-bit loading using the newest version of bitsandbytes, transformers and speed up. On this case, the 40B model takes ~27 GB of RAM to run. Unfortunately, that is barely greater than the memory available in cards akin to 3090 or 4090, however it’s enough to run on 30 or 40 GB cards.
Text Generation Inference
Text Generation Inference is a production ready inference
container developed by Hugging Face to enable easy deployment of huge language models.
Its most important features are:
- Continuous batching
- Token streaming using Server-Sent Events (SSE)
- Tensor Parallelism for faster inference on multiple GPUs
- Optimized transformers code using custom CUDA kernels
- Production ready logging, monitoring and tracing with Prometheus and Open Telemetry
Since v0.8.2, Text Generation Inference supports Falcon 7b and 40b models natively without counting on the Transformers
“trust distant code” feature, allowing for airtight deployments and security audits. As well as, the Falcon
implementation includes custom CUDA kernels to significantly decrease end-to-end latency.
![]() |
|---|
| Inference Endpoints now support Text Generation Inference. Deploy the Falcon 40B Instruct model easily on 1xA100 with Int-8 quantization |
Text Generation Inference is now integrated inside Hugging Face’s Inference Endpoints. To deploy a Falcon model, go to
the model page and click on on the
Deploy -> Inference Endpoints widget.
For 7B models, we advise you to pick out “GPU [medium] – 1x Nvidia A10G”.
For 40B models, you will have to deploy on “GPU [xlarge] – 1x Nvidia A100” and activate quantization:
Advanced configuration -> Serving Container -> Int-8 Quantization. Note: You would possibly must request a quota upgrade via email to api-enterprise@huggingface.co
Evaluation
So how good are the Falcon models? An in-depth evaluation from the Falcon authors will probably be released soon, so within the meantime we ran each the bottom and instruct models through our open LLM benchmark. This benchmark measures each the reasoning capabilities of LLMs and their ability to supply truthful answers across the next domains:
- AI2 Reasoning Challenge (ARC): Grade-school multiple alternative science questions.
- HellaSwag: Commonsense reasoning around on a regular basis events.
- MMLU: Multiple-choice questions in 57 subjects (skilled & academic).
- TruthfulQA: Tests the model’s ability to separate fact from an adversarially-selected set of incorrect statements.
The outcomes show that the 40B base and instruct models are very strong, and currently rank 1st and 2nd on the LLM leaderboard 🏆!
As noted by Thomas Wolf, one surprisingly insight here is that the 40B models were pretrained on around half the compute needed for LLaMa 65B (2800 vs 6300 petaflop days), which suggests we’ve not quite hit the bounds of what is “optimal” for LLM pretraining.
For the 7B models, we see that the bottom model is healthier than llama-7b and edges out MosaicML’s mpt-7b to develop into the present best pretrained LLM at this scale. A shortlist of popular models from the leaderboard is reproduced below for comparison:
Although the open LLM leaderboard doesn’t measure chat capabilities (where human evaluation is the gold standard), these preliminary results for the Falcon models are very encouraging!
Let’s now take a take a look at how you possibly can fine-tune your very own Falcon models – perhaps certainly one of yours will find yourself on top of the leaderboard 🤗.
Tremendous-tuning with PEFT
Training 10B+ sized models will be technically and computationally difficult. On this section we take a look at the tools available within the Hugging Face ecosystem to efficiently train extremely large models on easy hardware and show tips on how to fine-tune the Falcon-7b on a single NVIDIA T4 (16GB – Google Colab).
Let’s have a look at how we are able to train Falcon on the Guanaco dataset a high-quality subset of the Open Assistant dataset consisting of around 10,000 dialogues. With the PEFT library we are able to use the recent QLoRA approach to fine-tune adapters which are placed on top of the frozen 4-bit model. You may learn more concerning the integration of 4-bit quantized models on this blog post.
Because only a tiny fraction of the model is trainable when using Low Rank Adapters (LoRA), each the variety of learned parameters and the dimensions of the trained artifact are dramatically reduced. As shown within the screenshot below, the saved model has only 65MB for the 7B parameters model (15GB in float16).
![]() |
|---|
| The ultimate repository has only 65MB of weights – in comparison with the unique model that has roughly 15GB in half precision |
More specifically, after choosing the goal modules to adapt (in practice the query / key layers of the eye module), small trainable linear layers are attached close to those modules as illustrated below). The hidden states produced by the adapters are then added to the unique states to get the ultimate hidden state.
![]() |
|---|
| The output activations original (frozen) pretrained weights (left) are augmented by a low rank adapter comprised of weight matrices A and B (right). |
Once trained, there isn’t any need to avoid wasting your complete model as the bottom model was kept frozen. As well as, it is feasible to maintain the model in any arbitrary dtype (int8, fp4, fp16, etc.) so long as the output hidden states from these modules are casted to the identical dtype because the ones from the adapters – that is the case for bitsandbytes modules (Linear8bitLt and Linear4bit ) that return hidden states with the identical dtype as the unique unquantized module.
We fine-tuned the 2 variants of the Falcon models (7B and 40B) on the Guanaco dataset. We fine-tuned the 7B model on a single NVIDIA-T4 16GB, and the 40B model on a single NVIDIA A100 80GB. We used 4bit quantized base models and the QLoRA method, in addition to the recent SFTTrainer from the TRL library.
The complete script to breed our experiments using PEFT is accessible here, but only a couple of lines of code are required to quickly run the SFTTrainer (without PEFT for simplicity):
from datasets import load_dataset
from trl import SFTTrainer
from transformers import AutoTokenizer, AutoModelForCausalLM
dataset = load_dataset("imdb", split="train")
model_id = "tiiuae/falcon-7b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
trainer = SFTTrainer(
model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=512,
)
trainer.train()
Take a look at the original qlora repository for extra details about evaluating the trained models.
Tremendous-tuning Resources
Conclusion
Falcon is an exciting recent large language model which will be used for industrial applications. On this blog post we showed its capabilities, tips on how to run it in your individual environment and the way easy to fine-tune on custom data inside within the Hugging Face ecosystem. We’re excited to see what the community will construct with it!





