Introduction to the Open LLM Falcon-40B: Performance, Training Data, and Architecture Performance on OpenLLM Falcon RefinedWeb Pre-Training of Falcon-40B and Falcon-7B Instruct versions of Falcon-40B/7B Easy methods to Use Falcon-7B on Your GPU with QLoRa Conclusion

Start using Falcon-7B, Falcon-40B, and their instruct versions

The head of a falcon. — Photo by Brandon on Unsplash

The Falcon models have drawn plenty of attention since they’ve been released in May 2023.

They’re causal large language models (LLM), or so-called “decoder-only” models, very very similar to GPT.

Definition: Causal Language Model

Causal language modeling involves predicting the token that follows a sequence of tokens. During training, the model’s attention is solely directed toward the left context. The proper context is masked. These models are often trained on billion words.

The Falcon models are completely free, even for business use (Apache 2.0 License), since May thirty first. The Falcon models are developed and trained by the Technology Innovation Institute (TII) of Abu Dhabi.

In line with the primary results, Falcon-40B, the largest of the Falcon models, outperforms all the opposite causal LLMs, including LLaMa-65B and MPT-7B.

On this blog post, I introduce intimately Falcon-40B, Falcon-7B, and their instruct versions. We’ll see how they perform in comparison with other models, how they were trained, and how one can run Falcon7-B on your individual GPU with QLoRa.

The instruct version of Falcon-40B is ranked first on the OpenLLM leaderboard. The usual version is ranked second.

The OpenLLM leaderboard evaluates the performance of LLMs on 4 tasks:

AI2 Reasoning Challenge (25-shot): Questions of grade-school science.
HellaSwag (10-shot): A commonsense inference benchmark.
MMLU (5-shot): 57 tasks in various domains corresponding to maths, computer science, and law.
TruthfulQA (0-shot): A benchmark that evaluates how truthful is the model when answering questions.

Falcon-40B outperforms Meta AI’s LLaMa-65B on all these tasks.

The Falcon models were mainly trained on the Falcon RefinedWeb dataset. It was also created by TII and is distributed under an Apache 2.0 license.

RefinedWeb was extracted from CommonCrawl and has been thoroughly curated. TII claims it’s multimodal-friendly since they preserved links and alt texts of images.

Within the dataset card published within the Hugging Face Hub, TII wrote: “ […]”. To me, it’s thus unclear whether the Falcon models have been trained on this public version of the dataset, which is simply an “extract”, or whether or not they have used an even bigger internal version.

This extract requires 2.8 Tb of hard disk space to be unpacked.

Because it is accessible within the Hugging Face Hub, you simply must run the next lines to begin using it:

from datasets import load_dataset
rw = load_dataset("tiiuae/falcon-refinedweb")

Note: You wish the “datasets” library. If you happen to don’t have it, you’ll be able to install it with “pip install datasets”.

RefinedWeb was combined with curated corpora to coach the Falcon models.

This dataset represents 75% of the pre-training data of the Falcon models. It covers only English. So as to add more languages, they’ve also prepared the “RefinedWeb-Europe” which covers several European languages: German, Spanish, French, Italian, Portuguese, Polish, Dutch, Romanian, Czech, and Swedish.

Finally, to cover more genres and domains, they added corpora of books, conversations (e.g., from Reddit), code, technical reports, and scientific papers (e.g., from arXiv). Note: They didn’t disclose the source for “code”. Additionally it is unclear what are the licenses of the datasets they compiled.

In total, that’s 1,500 billion tokens used to pre-trained the Falcon models.

For pre-training, they used:

The Falcon-40B has the next architecture:

Layers: 60
Embedding dimensions: 8,192
Heads: 64
Vocabulary size: 65,024
Sequence length: 2,048

This could be very much like the architecture of LLaMa, except that the vocabulary is twice greater.

For my part, the sequence length is sort of short at a time once we see LLMs accepting sequences of greater than 10,000 tokens, corresponding to GPT-4 and Claude.

The Falcon-7B has a smaller architecture that allows its fine-tuning on consumer hardware. The one differences with the 40B version are that the variety of layers and embedding dimensions are halved:

Layers: 60
Embedding dimensions: 4,544

Each versions were trained with bfloat16 precision and AdamW. They used AWS SageMaker with 384 A100 40GB GPUs in P4d instances but didn’t disclose yet how long the training lasted.

The instruct versions of Falcon-40B and 7B perform even higher.

Falcon-40B-Instruct was trained on AWS SageMaker, utilizing P4d instances equipped with 64 A100 40GB GPUs. For Falcon-7B-Instruct, they only used 32 A100.

They were fine-tuned on 250 million tokens of a mix of chat/instruct datasets sourced from Bai ze, GPT4all, GPTeacher, and 13 million tokens from the RefinedWeb corpus.

Bai ze is a dataset generated by ChatGPT. I could be cautious about using the instruct version of Falcon models in business applications. As per OpenAI’s terms of use:

“Restrictions. You could not […] (iii) use output from the Services to develop models that compete with OpenAI”

“Services” includes ChatGPT. And Falcon-40B is a model that may “compete” with OpenAI’s GPT models.

In a previous article, I introduced QLoRa to fine-tune LLMs on consumer hardware:

You’ll be able to follow the identical steps for Falcon-7B however it won’t work on the free instance of Google Colab. The model requires an excessive amount of CPU RAM.

If you may have 32 Gb of RAM in your computer, this could work. If you happen to don’t have that much RAM, you’ll have to go for cloud computing or Google Colab Pro, as an illustration.

Once you may have an environment that may support Falcon-7B, there are still some minor modifications to perform to my QLoRa tutorial.

First, you have to install “einops”:

pip install -q einops

Then, modify the loading of the model as follows:

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map="auto", trust_remote_code=True)

On this line, “trust_remote_code=True” is vital. That is the best way Hugging Face gets your consent that some code is directly executed in your machine by the model. Here, Falcon runs a configuration script.

Apart from that, the whole lot else should work similar to in my tutorial.

If you happen to don’t need to use QLoRa and have access to a GPU cluster, the usual way of loading and running Falcon-7B/Falcon-40B could be as described within the Hugging Face models’ cards:

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torchmodel = "tiiuae/falcon-40b"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
)
sequences = pipeline(
"Girafatron is obsessive about giraffes, probably the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant compared to the fantastic majesty of the giraffe.nDaniel: Hello, Girafatron!nGirafatron:",
max_length=200,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")

The Falcon models are pre-trained LLMs. You should utilize them for any natural language processing task if you may have the information to fine-tune them. Note that, even without fine-tuning, the usual (non-instruct) versions already perform thoroughly for a lot of tasks as shown on the OpenLLM leaderboard for answering questions from various domains and for commonsense inference.

The “instruct” versions of the Falcon models are already fine-tuned. They behave like ChatGPT, i.e., a chatbot with general knowledge.

The Falcon models are also very interesting alternatives to the favored LLaMa model. Falcon-40B is:

: LLaMa is 65 billion parameters while Falcon-40B is simply 40 billion parameters, so it requires less memory.
: On the OpenLLM leaderboard, Falcon-40B is ranked first.
: Falcon models are distributed under an Apache 2.0 license allowing business use while LLaMa can only be used for research purposes.

If you happen to are concerned with getting more details about these models, regulate this blog post. TII will release a scientific paper/technical paper describing in additional detail what they did. I’ll drop the link here once it’s online.

Introduction to the Open LLM Falcon-40B: Performance, Training Data, and Architecture Performance on OpenLLM Falcon RefinedWeb Pre-Training of Falcon-40B and Falcon-7B Instruct versions of Falcon-40B/7B Easy methods to Use Falcon-7B on Your GPU with QLoRa Conclusion

Start using Falcon-7B, Falcon-40B, and their instruct versions

What are your thoughts on this topic?
Let us know in the comments below.

340 COMMENTS

Share this article

Recent posts

Using & Mixing Hugging Face Models with Gradio 2.0

GPT-Neo and the 🤗 Accelerated Inference API

Sentence Transformers within the Hugging Face Hub

What OpenAI and Jony Ive are constructing

Deploy Hugging Face models easily with Amazon SageMaker

Introduction to the Open LLM Falcon-40B: Performance, Training Data, and Architecture Performance on OpenLLM Falcon RefinedWeb Pre-Training of Falcon-40B and Falcon-7B Instruct versions of Falcon-40B/7B Easy methods to Use Falcon-7B on Your GPU with QLoRa Conclusion

Start using Falcon-7B, Falcon-40B, and their instruct versions

What are your thoughts on this topic? Let us know in the comments below.

340 COMMENTS

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.