An 11B parameter pretrained language model and VLM, trained on over 5000B tokens and 11 languages

-





The Falcon 2 Models

TII is launching a brand new generation of models, Falcon 2, focused on providing the open-source community with a series of smaller models with enhanced performance and multi-modal support. Our goal is to enable cheaper inference and encourage the event of more downstream applications with improved usability.

The primary generation of Falcon models, featuring Falcon-40B and Falcon-180B, made a major contribution to the open-source community, promoting the discharge of advanced LLMs with permissive licenses. More detailed information on the previous generation of Falcon models could be present in the RefinedWeb, Penedo et al., 2023 and The Falcon Series of Open Language Models, Almazrouei et al., 2023 papers, and the Falcon and Falcon-180B blog posts.

The second generation of models is concentrated on increased usability and integrability, constructing a multi-modal ecosystem. We start this journey by releasing not only the bottom 11B LLM, but in addition the 11B VLM model that comes with image understanding capabilities. The vision-language model, or VLM, will allow users to have interaction in chats about visual content using text.

As with our previous work, the models offer support mainly in English but have good capabilities in ten other languages, including Spanish, French, and German.



Table of Contents



Falcon2-11B LLM



Training Data

Falcon2-11B was trained on over 5,000 GT (billion tokens) of RefinedWeb, a high-quality filtered and deduplicated web dataset, enhanced with curated corpora. It followed a four-stage training strategy. The primary three stages were focused on increasing the context length, from 2048 to 4096 and at last to 8192 tokens. The last stage aimed to further enhance performance using only high-quality data.

Overall, the info sources included RefinedWeb-English, RefinedWeb-Europe (cs, de, es, fr, it, nl, pl, pt, ro, sv), high-quality technical data, code data, and conversational data extracted from public sources.

The training stages were as follows:

Stage Context Length GT
Stage 1 2048 4500
Stage 2 4096 250
Stage 3 8192 250
Stage 4 8192 500

The information was tokenized with Falcon2-11B tokenizer, the identical tokenizer as for the previous Falcon models.



Model Architecture

The next table summarizes a number of the crucial details concerning the model architecture:

Design selection Value
Variety of Transformer Blocks 60
Variety of Query Heads 32
Variety of Key/Value Heads 8
Head Dimensions 128
Parallel Attention yes
MLP Upscale Factor 4



Training Procedure

Falcon2-11B was trained on 1024 A100 40GB GPUs for nearly all of the training, using a 3D parallelism strategy (TP=8, PP=1, DP=128) combined with ZeRO and Flash-Attention 2.



Training Hyperparameters

Hyperparameter Value
Precision bfloat16
Optimizer AdamW
Max LR 3.7e-4
Min LR 1.89e-5
LR schedule Cos decay (stage 1)
Context length 8192 (stages 3 and 4)
Weight decay 1e-1
Z-loss 1e-4
Batch size Variable



Falcon2-11B Evaluation



English performance

Performance on Open LLM Leaderboard tasks:

Checkpoint GT HellaSwag-10 Winogrande-5 ArcChallenge-25 TruthfulQA-0 MMLU-5 GSMK8k-5 Average
Falcon2-11B 5500 82.91 78.30 59.73 52.56 58.37 53.83 64.28
Falcon-40B 1000 85.28 81.29 61.86 41.65 56.89 21.46 58.07
Falcon-7B 1500 78.13 72.38 47.87 34.26 27.79 4.62 44.17
Gemma-7B 6000 82.47 78.45 61.09 44.91 66.03 52.77 64.29
Llama3-8B 15000 82.09 77.35 59.47 43.90 66.69 44.79 62.38
Mistral-7B N/A 83.31 78.37 59.98 42.15 64.16 37.83 60.97

The Hugging Face Leaderboard team provided an official evaluation of our model on the Open LLM Leaderboard tasks. The model performs higher than models comparable to Llama3-8B (trained on thrice more data) and Mistral-7B, and on par with Gemma-7b.

Zero shot performance:

Checkpoint GT HellaSwag ArcEasy Winogrande ArcChallenge
Falcon2-11B 5500 82.07 77.78 78.30 50.17
Falcon-40B 1000 82.82 81.86 76.4 54.69
Falcon-7B 1500 76.31 74.74 67.17 43.43

The evaluation results show that the Falcon2-11B shows similar performance to Falcon-40B, at a 4 times smaller model size!



Multilingual capabilities

Using the Multilingual LLM leaderboard, we compare the Falcon2-11B model to the Llama-7B and Bloom-7B. For reference, we also include Falcon-40B (that supports the identical languages), Falcon-7B (that supports French) and Mistral-7B.

Model Language ID ArcChallenge-25 Hellaswag MMLU 25 TQA Average
Falcon2-11B de 43.7 67.96 38.3 47.53 49.37
es 46.2 73.63 37.9 46.43 51.06
fr 45.8 72.41 39.53 47.30 51.27
it 45.6 70.83 38.05 47.14 50.42
nl 41.7 69.05 38.29 48.81 49.47
ro 42.4 66.24 38.01 45.53 48.04
Falcon-40B de 45.1 68.3 36.2 39.8 47.4
es 48.5 73.9 37.2 39.0 49.6
fr 47.6 72.9 37.3 38.5 49.1
it 46.3 70.2 36.4 40.7 48.4
nl 42.9 68.4 36.5 40.9 47.1
ro 43.2 66.0 35.7 39.8 46.2
Falcon-7B fr 37.3 64.1 28.4 34.0 40.9
Mistral-7B de 41.2 58.7 40.5 44.9 46.3
es 44.2 65.3 42.4 43.1 48.7
fr 44.9 64.4 41.9 43.0 48.6
it 43.2 60.9 39.7 43.1 46.7
nl 40.0 57.9 41.4 43.3 45.7
ro 40.7 53.6 39.3 43.6 44.3
Llama-7B de 35.1 49.9 29.9 38.3 38.3
es 36.8 56.4 30.3 37.0 40.1
fr 37.3 55.7 30.5 39.9 40.9
it 35.8 52.0 29.9 39.6 39.3
nl 33.6 48.7 29.8 40.0 38.0
ro 32.4 44.9 29.7 37.0 36.0
Bloom-7B de 26.3 32.4 28.1 43.7 32.6
es 38.1 56.7 28.9 40.4 41.0
fr 36.7 56.6 29.9 40.9 41.0
it 29.0 40.8 27.6 43.7 35.3
nl 23.1 31.7 27.5 42.7 31.3
ro 26.9 31.8 27.4 46.1 33.1

Within the spirit of the unique Falcon models, the Falcon2-11B was trained not only on English data but in addition on ten other languages. Our multilingual evaluation results show that the model presents good capabilities within the six languages (de, es, fr, it, nl, ro) featured on the Multilingual LLM Leaderboard and truly shows higher performance than the Falcon-40B and a number of other other multilingual models on all of the cited languages.

We’ll soon release more extensive evaluation results for multilingual capabilities within the Falcon2-11B model card!



Code generation capabilities

We check the model’s performance on code generation against the BigCode Leaderboard on the HumanEval benchmark for the Python language, obtaining pass@1 of 29.59%.



Using Falcon2-11B

from transformers import AutoTokenizer
import transformers
import torch

model = "tiiuae/falcon-11B"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

After which, you’d run text generation using code like the next:

sequences = pipeline(
   "Are you able to explain the concept of Quantum Computing?",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")



Falcon2-11B VLM

Falcon2-11B VLM is a vision-language model (VLM) built on top of the LLM, that moreover handles image inputs and is able to answering queries concerning the images. To attain this, we integrate the pretrained CLIP ViT-L/14 vision encoder with our Falcon2-11B chat-finetuned model, and train with image-text data.

To boost the VLM’s perception of fine-grained details w.r.t small objects in images, we employ a dynamic encoding mechanism at high-resolution for image inputs, much like LLaVA-Next.



Training

The training is finished in two stages: pretraining and finetuning. In each stages, the visual encoder weights are kept frozen. Within the pretraining stage, the LLM is kept frozen, and only the multimodal projector is trained on 558K image-caption pairs.
This permits the multimodal projector to learn a mapping from visual to text embedding space. During finetuning, each the projector and LLM weights are trained on a corpus of 1.2M image-text instruction data from public datasets, which also includes multi-round conversations.



Falcon2-11B VLM Evaluation

Model MME GQA SQA POPE VQAv2 TextVQA MM-Bench SEED-IMG Average
Falcon2-11B VLM 1589/343 64.5 74.9 88.4 82.1 66.7 72.0 72.3 74.4
LLaVA-1.6 (Vicuna-7B) 1519/332 64.2 70.1 86.5 81.8 64.9 67.4 70.2 72.1
LLaVA-1.6 (Vicuna-13B) 1575/326 65.4 73.6 86.2 82.8 67.1 70.0 71.9 73.8
LLaVA-1.6 (Mistral-7B) 1498/321 64.8 72.8 86.7 82.2 65.7 68.7 72.2 73.3



Using Falcon2-11B-FalconVLM

from transformers import LlavaNextForConditionalGeneration, LlavaNextProcessor
from PIL import Image
import requests
import torch

processor = LlavaNextProcessor.from_pretrained("tiiuae/falcon-11B-vlm")
model = LlavaNextForConditionalGeneration.from_pretrained("tiiuae/falcon-11B-vlm", torch_dtype=torch.bfloat16)

url = "https://merzougabirding.com/wp-content/uploads/2023/09/falcon-size.jpg"
falcon_image = Image.open(requests.get(url, stream=True).raw)
prompt = "User: nWhat's special about this bird's vision?"

inputs = processor(prompt, images=falcon_image, return_tensors="pt", padding=True).to('cuda:0')

model.to('cuda:0')
output = model.generate(**inputs, max_new_tokens=256)


prompt_length = inputs['input_ids'].shape[1]
generated_captions = processor.decode(output[0], skip_special_tokens=True).strip()

print(generated_captions)



License information

The Falcon 2 models are made available under the TII Falcon 2 License, a permissive Apache 2.0-based software license which incorporates an acceptable use policy that promotes the responsible use of AI. This license was crafted throughout the spirit of TII’s commitment to the open source community.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x