Today, we’re excited to welcome TII’s Falcon 180B to HuggingFace! Falcon 180B sets a brand new state-of-the-art for open models. It’s the biggest openly available language model, with 180 billion parameters, and was trained on a large 3.5 trillion tokens using TII’s RefinedWeb dataset. This represents the longest single-epoch pretraining for an open model.
Yow will discover the model on the Hugging Face Hub (base and chat model) and interact with the model on the Falcon Chat Demo Space.
By way of capabilities, Falcon 180B achieves state-of-the-art results across natural language tasks. It topped the leaderboard for (pre-trained) open-access models (on the time of its release) and rivals proprietary models like PaLM-2. While difficult to rank definitively yet, it is taken into account on par with PaLM-2 Large, making Falcon 180B probably the most capable LLMs publicly known.
On this blog post, we explore what makes Falcon 180B so good by some evaluation results and show how you should use the model.
What’s Falcon-180B?
Falcon 180B is a model released by TII that follows previous releases within the Falcon family.
Architecture-wise, Falcon 180B is a scaled-up version of Falcon 40B and builds on its innovations corresponding to multiquery attention for improved scalability. We recommend reviewing the initial blog post introducing Falcon to dive into the architecture. Falcon 180B was trained on 3.5 trillion tokens on as much as 4096 GPUs concurrently, using Amazon SageMaker for a complete of ~7,000,000 GPU hours. This implies Falcon 180B is 2.5 times larger than Llama 2 and was trained with 4x more compute.
The dataset for Falcon 180B consists predominantly of web data from RefinedWeb (~85%). As well as, it has been trained on a combination of curated data corresponding to conversations, technical papers, and a small fraction of code (~3%). This pretraining dataset is large enough that even 3.5 trillion tokens constitute lower than an epoch.
The released chat model is fine-tuned on chat and instruction datasets with a combination of several large-scale conversational datasets.
‼️ Business use:
Falcon 180b might be commercially used but under very restrictive conditions, excluding any “hosting use”. We recommend to ascertain the license and seek the advice of your legal team if you happen to are concerned about using it for business purposes.
How good is Falcon 180B?
Falcon 180B was one of the best openly released LLM at its release, outperforming Llama 2 70B and OpenAI’s GPT-3.5 on MMLU, and is on par with Google’s PaLM 2-Large on HellaSwag, LAMBADA, WebQuestions, Winogrande, PIQA, ARC, BoolQ, CB, COPA, RTE, WiC, WSC, ReCoRD. Falcon 180B typically sits somewhere between GPT 3.5 and GPT4 depending on the evaluation benchmark and further finetuning from the community will likely be very interesting to follow now that it’s openly released.
With 68.74 on the Hugging Face Leaderboard on the time of release, Falcon 180B was the highest-scoring openly released pre-trained LLM, surpassing Meta’s Llama 2.*
| Model | Size | Leaderboard rating | Business use or license | Pretraining length |
|---|---|---|---|---|
| Falcon | 180B | 67.85 | 🟠 | 3,500B |
| Llama 2 | 70B | 67.87 | 🟠 | 2,000B |
| LLaMA | 65B | 61.19 | 🔴 | 1,400B |
| Falcon | 40B | 58.07 | 🟢 | 1,000B |
| MPT | 30B | 52.77 | 🟢 | 1,000B |
- The Open LLM Leaderboard added two recent benchmarks in November 2023, and we updated the table above to reflect the most recent rating (67.85). Falcon is on par with Llama 2 70B based on the brand new methodology.
The quantized Falcon models preserve similar metrics across benchmarks. The outcomes were similar when evaluating torch.float16, 8bit, and 4bit. See leads to the Open LLM Leaderboard.
How you can use Falcon 180B?
Falcon 180B is out there within the Hugging Face ecosystem, starting with Transformers version 4.33.
Demo
You may easily try the Big Falcon Model (180 billion parameters!) in this Space or within the playground embedded below:
Hardware requirements
We ran several tests on the hardware needed to run the model for various use cases. Those should not the minimum numbers, however the minimum numbers for the configurations we had access to.
| Type | Kind | Memory | Example | |
|---|---|---|---|---|
| Falcon 180B | Training | Full fine-tuning | 5120GB | 8x 8x A100 80GB |
| Falcon 180B | Training | LoRA with ZeRO-3 | 1280GB | 2x 8x A100 80GB |
| Falcon 180B | Training | QLoRA | 160GB | 2x A100 80GB |
| Falcon 180B | Inference | BF16/FP16 | 640GB | 8x A100 80GB |
| Falcon 180B | Inference | GPTQ/int4 | 320GB | 8x A100 40GB |
Prompt format
The bottom model has no prompt format. Keep in mind that it’s not a conversational model or trained with instructions, so don’t expect it to generate conversational responses—the pretrained model is an amazing platform for further finetuning, but you most likely shouldn’t driectly use it out of the box. The Chat model has a quite simple conversation structure.
System: Add an optional system prompt here
User: That is the user input
Falcon: That is what the model generates
User: This is likely to be a second turn input
Falcon: and so forth
Transformers
With the discharge of Transformers 4.33, you should use Falcon 180B and leverage all of the tools within the HF ecosystem, corresponding to:
- training and inference scripts and examples
- secure file format (safetensors)
- integrations with tools corresponding to bitsandbytes (4-bit quantization), PEFT (parameter efficient fine-tuning) and GPTQ
- assisted generation (also generally known as “speculative decoding”)
- RoPE scaling support for larger context lengths
- wealthy and powerful generation parameters
Use of the model requires you to just accept its license and terms of use. Please, be sure you’re logged into your Hugging Face account and ensure you will have the most recent version of transformers:
pip install --upgrade transformers
huggingface-cli login
bfloat16
That is the way you’d use the bottom model in bfloat16. Falcon 180B is a giant model, so please take note of the hardware requirements summarized within the table above.
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
model_id = "tiiuae/falcon-180B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
prompt = "My name is Pedro, I live in"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
output = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
do_sample=True,
temperature=0.6,
top_p=0.9,
max_new_tokens=50,
)
output = output[0].to("cpu")
print(tokenizer.decode(output))
This might produce an output corresponding to:
My name is Pedro, I live in Portugal and I'm 25 years old. I'm a graphic designer, but I'm also captivated with photography and video.
I really like to travel and I'm all the time on the lookout for recent adventures. I really like to satisfy recent people and explore recent places.
8-bit and 4-bit with bitsandbytes
The 8-bit and 4-bit quantized versions of Falcon 180B show almost no difference in evaluation with respect to the bfloat16 reference! This is superb news for inference, as you’ll be able to confidently use a quantized version to scale back hardware requirements. Bear in mind, though, that 8-bit inference is much faster than running the model in 4-bit.
To make use of quantization, it is advisable install the bitsandbytes library and easily enable the corresponding flag when loading the model:
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
load_in_8bit=True,
device_map="auto",
)
Chat Model
As mentioned above, the version of the model fine-tuned to follow conversations used a really straightforward training template. Now we have to follow the identical pattern with the intention to run chat-style inference. For reference, you’ll be able to take a take a look at the format_prompt function within the Chat demo, which looks like this:
def format_prompt(message, history, system_prompt):
prompt = ""
if system_prompt:
prompt += f"System: {system_prompt}n"
for user_prompt, bot_response in history:
prompt += f"User: {user_prompt}n"
prompt += f"Falcon: {bot_response}n"
prompt += f"User: {message}nFalcon:"
return prompt
As you’ll be able to see, interactions from the user and responses by the model are preceded by User: and Falcon: separators. We concatenate them together to form a prompt containing the conversation’s whole history. We are able to provide a system prompt to tweak the generation style.
Additional Resources
Acknowledgments
Releasing such a model with support and evaluations within the ecosystem wouldn’t be possible without the contributions of many community members, including Clémentine and Eleuther Evaluation Harness for LLM evaluations; Loubna and BigCode for code evaluations; Nicolas for Inference support; Lysandre, Matt, Daniel, Amy, Joao, and Arthur for integrating Falcon into transformers. Because of Baptiste and Patrick for the open-source demo. Because of Thom, Lewis, TheBloke, Nouamane, Tim Dettmers for multiple contributions enabling this to get out. Finally, due to the HF Cluster for enabling running LLM evaluations in addition to providing inference for a free, open-source demo of the model.


