Introduction to the Open LLM Falcon-40B: Performance, Training Data, and Architecture Performance on OpenLLM Falcon RefinedWeb Pre-Training of Falcon-40B and Falcon-7B Instruct versions of Falcon-40B/7B The right way to Use Falcon-7B on Your GPU with QLoRa Conclusion


Start using Falcon-7B, Falcon-40B, and their instruct versions

The head of a falcon.
Photo by Brandon on Unsplash

The Falcon models have drawn quite a lot of attention since they’ve been released in May 2023.

They’re causal large language models (LLM), or so-called “decoder-only” models, very very like GPT.

Definition: Causal Language Model

Causal language modeling involves predicting the token that follows a sequence of tokens. During training, the model’s attention is solely directed toward the left context. The appropriate context is masked. These models are often trained on billion words.

The Falcon models are completely free, even for industrial use (Apache 2.0 License), since May thirty first. The Falcon models are developed and trained by the Technology Innovation Institute (TII) of Abu Dhabi.

In keeping with the primary results, Falcon-40B, the largest of the Falcon models, outperforms all the opposite causal LLMs, including LLaMa-65B and MPT-7B.

On this blog post, I introduce intimately Falcon-40B, Falcon-7B, and their instruct versions. We’ll see how they perform in comparison with other models, how they were trained, and the way to run Falcon7-B on your individual GPU with QLoRa.

The instruct version of Falcon-40B is ranked first on the OpenLLM leaderboard. The usual version is ranked second.

The OpenLLM leaderboard evaluates the performance of LLMs on 4 tasks:

  • AI2 Reasoning Challenge (25-shot): Questions of grade-school science.
  • HellaSwag (10-shot): A commonsense inference benchmark.
  • MMLU (5-shot): 57 tasks in various domains akin to maths, computer science, and law.
  • TruthfulQA (0-shot): A benchmark that evaluates how truthful is the model when answering questions.

Falcon-40B outperforms Meta AI’s LLaMa-65B on all these tasks.

The Falcon models were mainly trained on the Falcon RefinedWeb dataset. It was also created by TII and is distributed under an Apache 2.0 license.

RefinedWeb was extracted from CommonCrawl and has been thoroughly curated. TII claims it’s multimodal-friendly since they preserved links and alt texts of images.

Within the dataset card published within the Hugging Face Hub, TII wrote: “ […]”. To me, it’s thus unclear whether the Falcon models have been trained on this public version of the dataset, which is simply an “extract”, or whether or not they have used a much bigger internal version.

This extract requires 2.8 Tb of harddrive space to be unpacked.

Because it is obtainable within the Hugging Face Hub, you simply must run the next lines to begin using it:

from datasets import load_dataset
rw = load_dataset("tiiuae/falcon-refinedweb")

Note: You would like the “datasets” library. Should you don’t have it, you may install it with “pip install datasets”.

RefinedWeb was combined with curated corpora to coach the Falcon models.

This dataset represents 75% of the pre-training data of the Falcon models. It covers only English. So as to add more languages, they’ve also prepared the “RefinedWeb-Europe” which covers several European languages: German, Spanish, French, Italian, Portuguese, Polish, Dutch, Romanian, Czech, and Swedish.

Finally, to cover more genres and domains, they added corpora of books, conversations (e.g., from Reddit), code, technical reports, and scientific papers (e.g., from arXiv). Note: They didn’t disclose the source for “code”. Additionally it is unclear what are the licenses of the datasets they compiled.

In total, that’s 1,500 billion tokens used to pre-trained the Falcon models.

For pre-training, they used:

The Falcon-40B has the next architecture:

  • Layers: 60
  • Embedding dimensions: 8,192
  • Heads: 64
  • Vocabulary size: 65,024
  • Sequence length: 2,048

This could be very much like the architecture of LLaMa, except that the vocabulary is twice larger.

For my part, the sequence length is kind of short at a time once we see LLMs accepting sequences of greater than 10,000 tokens, akin to GPT-4 and Claude.

The Falcon-7B has a smaller architecture that permits its fine-tuning on consumer hardware. The one differences with the 40B version are that the variety of layers and embedding dimensions are halved:

  • Layers: 60
  • Embedding dimensions: 4,544

Each versions were trained with bfloat16 precision and AdamW. They used AWS SageMaker with 384 A100 40GB GPUs in P4d instances but didn’t disclose yet how long the training lasted.

The instruct versions of Falcon-40B and 7B perform even higher.

Falcon-40B-Instruct was trained on AWS SageMaker, utilizing P4d instances equipped with 64 A100 40GB GPUs. For Falcon-7B-Instruct, they only used 32 A100.

They were fine-tuned on 250 million tokens of a combination of chat/instruct datasets sourced from Bai ze, GPT4all, GPTeacher, and 13 million tokens from the RefinedWeb corpus.

Bai ze is a dataset generated by ChatGPT. I could be cautious about using the instruct version of Falcon models in industrial applications. As per OpenAI’s terms of use:

“Restrictions. It’s possible you’ll not […] (iii) use output from the Services to develop models that compete with OpenAI”

“Services” includes ChatGPT. And Falcon-40B is a model that may “compete” with OpenAI’s GPT models.

In a previous article, I introduced QLoRa to fine-tune LLMs on consumer hardware:

You possibly can follow the identical steps for Falcon-7B however it won’t work on the free instance of Google Colab. The model requires an excessive amount of CPU RAM.

If you have got 32 Gb of RAM in your computer, this could work. Should you don’t have that much RAM, you should have to go for cloud computing or Google Colab Pro, as an example.

Once you have got an environment that may support Falcon-7B, there are still some minor modifications to perform to my QLoRa tutorial.

First, you have to install “einops”:

pip install -q einops

Then, modify the loading of the model as follows:

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto", trust_remote_code=True)

On this line, “trust_remote_code=True” is crucial. That is the best way Hugging Face gets your consent that some code is directly executed in your machine by the model. Here, Falcon runs a configuration script.

Apart from that, every little thing else should work similar to in my tutorial.

Should you don’t need to use QLoRa and have access to a GPU cluster, the usual way of loading and running Falcon-7B/Falcon-40B could be as described within the Hugging Face models’ cards:

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "tiiuae/falcon-40b"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
sequences = pipeline(
"Girafatron is obsessive about giraffes, essentially the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when put next to the wonderful majesty of the giraffe.nDaniel: Hello, Girafatron!nGirafatron:",
for seq in sequences:
print(f"Result: {seq['generated_text']}")

The Falcon models are pre-trained LLMs. You should utilize them for any natural language processing task if you have got the information to fine-tune them. Note that, even without fine-tuning, the usual (non-instruct) versions already perform thoroughly for a lot of tasks as shown on the OpenLLM leaderboard for answering questions from various domains and for commonsense inference.

The “instruct” versions of the Falcon models are already fine-tuned. They behave like ChatGPT, i.e., a chatbot with general knowledge.

The Falcon models are also very interesting alternatives to the favored LLaMa model. Falcon-40B is:

  • : LLaMa is 65 billion parameters while Falcon-40B is simply 40 billion parameters, so it requires less memory.
  • : On the OpenLLM leaderboard, Falcon-40B is ranked first.
  • : Falcon models are distributed under an Apache 2.0 license allowing industrial use while LLaMa can only be used for research purposes.

Should you are eager about getting more details about these models, keep watch over this blog post. TII will release a scientific paper/technical paper describing in additional detail what they did. I’ll drop the link here once it’s online.


What are your thoughts on this topic?
Let us know in the comments below.


0 0 votes
Article Rating
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

Would love your thoughts, please comment.x