The Falcon 2 Models
TII is launching a brand new generation of models, Falcon 2, focused on providing the open-source community with a series of smaller models with enhanced performance and multi-modal support. Our goal is to enable cheaper inference and encourage the event of more downstream applications with improved usability.
The primary generation of Falcon models, featuring Falcon-40B and Falcon-180B, made a major contribution to the open-source community, promoting the discharge of advanced LLMs with permissive licenses. More detailed information on the previous generation of Falcon models could be present in the RefinedWeb, Penedo et al., 2023 and The Falcon Series of Open Language Models, Almazrouei et al., 2023 papers, and the Falcon and Falcon-180B blog posts.
The second generation of models is concentrated on increased usability and integrability, constructing a multi-modal ecosystem. We start this journey by releasing not only the bottom 11B LLM, but in addition the 11B VLM model that comes with image understanding capabilities. The vision-language model, or VLM, will allow users to have interaction in chats about visual content using text.
As with our previous work, the models offer support mainly in English but have good capabilities in ten other languages, including Spanish, French, and German.
Table of Contents
Falcon2-11B LLM
Training Data
Falcon2-11B was trained on over 5,000 GT (billion tokens) of RefinedWeb, a high-quality filtered and deduplicated web dataset, enhanced with curated corpora. It followed a four-stage training strategy. The primary three stages were focused on increasing the context length, from 2048 to 4096 and at last to 8192 tokens. The last stage aimed to further enhance performance using only high-quality data.
Overall, the info sources included RefinedWeb-English, RefinedWeb-Europe (cs, de, es, fr, it, nl, pl, pt, ro, sv), high-quality technical data, code data, and conversational data extracted from public sources.
The training stages were as follows:
| Stage | Context Length | GT |
|---|---|---|
| Stage 1 | 2048 | 4500 |
| Stage 2 | 4096 | 250 |
| Stage 3 | 8192 | 250 |
| Stage 4 | 8192 | 500 |
The information was tokenized with Falcon2-11B tokenizer, the identical tokenizer as for the previous Falcon models.
Model Architecture
The next table summarizes a number of the crucial details concerning the model architecture:
| Design selection | Value |
|---|---|
| Variety of Transformer Blocks | 60 |
| Variety of Query Heads | 32 |
| Variety of Key/Value Heads | 8 |
| Head Dimensions | 128 |
| Parallel Attention | yes |
| MLP Upscale Factor | 4 |
Training Procedure
Falcon2-11B was trained on 1024 A100 40GB GPUs for nearly all of the training, using a 3D parallelism strategy (TP=8, PP=1, DP=128) combined with ZeRO and Flash-Attention 2.
Training Hyperparameters
| Hyperparameter | Value |
|---|---|
| Precision | bfloat16 |
| Optimizer | AdamW |
| Max LR | 3.7e-4 |
| Min LR | 1.89e-5 |
| LR schedule | Cos decay (stage 1) |
| Context length | 8192 (stages 3 and 4) |
| Weight decay | 1e-1 |
| Z-loss | 1e-4 |
| Batch size | Variable |
Falcon2-11B Evaluation
English performance
Performance on Open LLM Leaderboard tasks:
| Checkpoint | GT | HellaSwag-10 | Winogrande-5 | ArcChallenge-25 | TruthfulQA-0 | MMLU-5 | GSMK8k-5 | Average |
|---|---|---|---|---|---|---|---|---|
| Falcon2-11B | 5500 | 82.91 | 78.30 | 59.73 | 52.56 | 58.37 | 53.83 | 64.28 |
| Falcon-40B | 1000 | 85.28 | 81.29 | 61.86 | 41.65 | 56.89 | 21.46 | 58.07 |
| Falcon-7B | 1500 | 78.13 | 72.38 | 47.87 | 34.26 | 27.79 | 4.62 | 44.17 |
| Gemma-7B | 6000 | 82.47 | 78.45 | 61.09 | 44.91 | 66.03 | 52.77 | 64.29 |
| Llama3-8B | 15000 | 82.09 | 77.35 | 59.47 | 43.90 | 66.69 | 44.79 | 62.38 |
| Mistral-7B | N/A | 83.31 | 78.37 | 59.98 | 42.15 | 64.16 | 37.83 | 60.97 |
The Hugging Face Leaderboard team provided an official evaluation of our model on the Open LLM Leaderboard tasks. The model performs higher than models comparable to Llama3-8B (trained on thrice more data) and Mistral-7B, and on par with Gemma-7b.
Zero shot performance:
| Checkpoint | GT | HellaSwag | ArcEasy | Winogrande | ArcChallenge |
|---|---|---|---|---|---|
| Falcon2-11B | 5500 | 82.07 | 77.78 | 78.30 | 50.17 |
| Falcon-40B | 1000 | 82.82 | 81.86 | 76.4 | 54.69 |
| Falcon-7B | 1500 | 76.31 | 74.74 | 67.17 | 43.43 |
The evaluation results show that the Falcon2-11B shows similar performance to Falcon-40B, at a 4 times smaller model size!
Multilingual capabilities
Using the Multilingual LLM leaderboard, we compare the Falcon2-11B model to the Llama-7B and Bloom-7B. For reference, we also include Falcon-40B (that supports the identical languages), Falcon-7B (that supports French) and Mistral-7B.
| Model | Language ID | ArcChallenge-25 | Hellaswag | MMLU 25 | TQA | Average |
|---|---|---|---|---|---|---|
| Falcon2-11B | de | 43.7 | 67.96 | 38.3 | 47.53 | 49.37 |
| es | 46.2 | 73.63 | 37.9 | 46.43 | 51.06 | |
| fr | 45.8 | 72.41 | 39.53 | 47.30 | 51.27 | |
| it | 45.6 | 70.83 | 38.05 | 47.14 | 50.42 | |
| nl | 41.7 | 69.05 | 38.29 | 48.81 | 49.47 | |
| ro | 42.4 | 66.24 | 38.01 | 45.53 | 48.04 | |
| Falcon-40B | de | 45.1 | 68.3 | 36.2 | 39.8 | 47.4 |
| es | 48.5 | 73.9 | 37.2 | 39.0 | 49.6 | |
| fr | 47.6 | 72.9 | 37.3 | 38.5 | 49.1 | |
| it | 46.3 | 70.2 | 36.4 | 40.7 | 48.4 | |
| nl | 42.9 | 68.4 | 36.5 | 40.9 | 47.1 | |
| ro | 43.2 | 66.0 | 35.7 | 39.8 | 46.2 | |
| Falcon-7B | fr | 37.3 | 64.1 | 28.4 | 34.0 | 40.9 |
| Mistral-7B | de | 41.2 | 58.7 | 40.5 | 44.9 | 46.3 |
| es | 44.2 | 65.3 | 42.4 | 43.1 | 48.7 | |
| fr | 44.9 | 64.4 | 41.9 | 43.0 | 48.6 | |
| it | 43.2 | 60.9 | 39.7 | 43.1 | 46.7 | |
| nl | 40.0 | 57.9 | 41.4 | 43.3 | 45.7 | |
| ro | 40.7 | 53.6 | 39.3 | 43.6 | 44.3 | |
| Llama-7B | de | 35.1 | 49.9 | 29.9 | 38.3 | 38.3 |
| es | 36.8 | 56.4 | 30.3 | 37.0 | 40.1 | |
| fr | 37.3 | 55.7 | 30.5 | 39.9 | 40.9 | |
| it | 35.8 | 52.0 | 29.9 | 39.6 | 39.3 | |
| nl | 33.6 | 48.7 | 29.8 | 40.0 | 38.0 | |
| ro | 32.4 | 44.9 | 29.7 | 37.0 | 36.0 | |
| Bloom-7B | de | 26.3 | 32.4 | 28.1 | 43.7 | 32.6 |
| es | 38.1 | 56.7 | 28.9 | 40.4 | 41.0 | |
| fr | 36.7 | 56.6 | 29.9 | 40.9 | 41.0 | |
| it | 29.0 | 40.8 | 27.6 | 43.7 | 35.3 | |
| nl | 23.1 | 31.7 | 27.5 | 42.7 | 31.3 | |
| ro | 26.9 | 31.8 | 27.4 | 46.1 | 33.1 |
Within the spirit of the unique Falcon models, the Falcon2-11B was trained not only on English data but in addition on ten other languages. Our multilingual evaluation results show that the model presents good capabilities within the six languages (de, es, fr, it, nl, ro) featured on the Multilingual LLM Leaderboard and truly shows higher performance than the Falcon-40B and a number of other other multilingual models on all of the cited languages.
We’ll soon release more extensive evaluation results for multilingual capabilities within the Falcon2-11B model card!
Code generation capabilities
We check the model’s performance on code generation against the BigCode Leaderboard on the HumanEval benchmark for the Python language, obtaining pass@1 of 29.59%.
Using Falcon2-11B
from transformers import AutoTokenizer
import transformers
import torch
model = "tiiuae/falcon-11B"
tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
torch_dtype=torch.bfloat16,
device_map="auto",
)
After which, you’d run text generation using code like the next:
sequences = pipeline(
"Are you able to explain the concept of Quantum Computing?",
max_length=200,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")
Falcon2-11B VLM
Falcon2-11B VLM is a vision-language model (VLM) built on top of the LLM, that moreover handles image inputs and is able to answering queries concerning the images. To attain this, we integrate the pretrained CLIP ViT-L/14 vision encoder with our Falcon2-11B chat-finetuned model, and train with image-text data.
To boost the VLM’s perception of fine-grained details w.r.t small objects in images, we employ a dynamic encoding mechanism at high-resolution for image inputs, much like LLaVA-Next.
Training
The training is finished in two stages: pretraining and finetuning. In each stages, the visual encoder weights are kept frozen. Within the pretraining stage, the LLM is kept frozen, and only the multimodal projector is trained on 558K image-caption pairs.
This permits the multimodal projector to learn a mapping from visual to text embedding space. During finetuning, each the projector and LLM weights are trained on a corpus of 1.2M image-text instruction data from public datasets, which also includes multi-round conversations.
Falcon2-11B VLM Evaluation
| Model | MME | GQA | SQA | POPE | VQAv2 | TextVQA | MM-Bench | SEED-IMG | Average |
|---|---|---|---|---|---|---|---|---|---|
| Falcon2-11B VLM | 1589/343 | 64.5 | 74.9 | 88.4 | 82.1 | 66.7 | 72.0 | 72.3 | 74.4 |
| LLaVA-1.6 (Vicuna-7B) | 1519/332 | 64.2 | 70.1 | 86.5 | 81.8 | 64.9 | 67.4 | 70.2 | 72.1 |
| LLaVA-1.6 (Vicuna-13B) | 1575/326 | 65.4 | 73.6 | 86.2 | 82.8 | 67.1 | 70.0 | 71.9 | 73.8 |
| LLaVA-1.6 (Mistral-7B) | 1498/321 | 64.8 | 72.8 | 86.7 | 82.2 | 65.7 | 68.7 | 72.2 | 73.3 |
Using Falcon2-11B-FalconVLM
from transformers import LlavaNextForConditionalGeneration, LlavaNextProcessor
from PIL import Image
import requests
import torch
processor = LlavaNextProcessor.from_pretrained("tiiuae/falcon-11B-vlm")
model = LlavaNextForConditionalGeneration.from_pretrained("tiiuae/falcon-11B-vlm", torch_dtype=torch.bfloat16)
url = "https://merzougabirding.com/wp-content/uploads/2023/09/falcon-size.jpg"
falcon_image = Image.open(requests.get(url, stream=True).raw)
prompt = "User: nWhat's special about this bird's vision?"
inputs = processor(prompt, images=falcon_image, return_tensors="pt", padding=True).to('cuda:0')
model.to('cuda:0')
output = model.generate(**inputs, max_new_tokens=256)
prompt_length = inputs['input_ids'].shape[1]
generated_captions = processor.decode(output[0], skip_special_tokens=True).strip()
print(generated_captions)
License information
The Falcon 2 models are made available under the TII Falcon 2 License, a permissive Apache 2.0-based software license which incorporates an acceptable use policy that promotes the responsible use of AI. This license was crafted throughout the spirit of TII’s commitment to the open source community.
