Spread Your Wings: Falcon 180B is here

Today, we’re excited to welcome TII’s Falcon 180B to HuggingFace! Falcon 180B sets a brand new state-of-the-art for open models. It’s the biggest openly available language model, with 180 billion parameters, and was trained on a large 3.5 trillion tokens using TII’s RefinedWeb dataset. This represents the longest single-epoch pretraining for an open model.

Yow will discover the model on the Hugging Face Hub (base and chat model) and interact with the model on the Falcon Chat Demo Space.

By way of capabilities, Falcon 180B achieves state-of-the-art results across natural language tasks. It topped the leaderboard for (pre-trained) open-access models (on the time of its release) and rivals proprietary models like PaLM-2. While difficult to rank definitively yet, it is taken into account on par with PaLM-2 Large, making Falcon 180B probably the most capable LLMs publicly known.

On this blog post, we explore what makes Falcon 180B so good by some evaluation results and show how you should use the model.

What’s Falcon-180B?

Falcon 180B is a model released by TII that follows previous releases within the Falcon family.

Architecture-wise, Falcon 180B is a scaled-up version of Falcon 40B and builds on its innovations corresponding to multiquery attention for improved scalability. We recommend reviewing the initial blog post introducing Falcon to dive into the architecture. Falcon 180B was trained on 3.5 trillion tokens on as much as 4096 GPUs concurrently, using Amazon SageMaker for a complete of ~7,000,000 GPU hours. This implies Falcon 180B is 2.5 times larger than Llama 2 and was trained with 4x more compute.

The dataset for Falcon 180B consists predominantly of web data from RefinedWeb (~85%). As well as, it has been trained on a combination of curated data corresponding to conversations, technical papers, and a small fraction of code (~3%). This pretraining dataset is large enough that even 3.5 trillion tokens constitute lower than an epoch.

The released chat model is fine-tuned on chat and instruction datasets with a combination of several large-scale conversational datasets.

‼️ Business use:
Falcon 180b might be commercially used but under very restrictive conditions, excluding any “hosting use”. We recommend to ascertain the license and seek the advice of your legal team if you happen to are concerned about using it for business purposes.

How good is Falcon 180B?

Falcon 180B was one of the best openly released LLM at its release, outperforming Llama 2 70B and OpenAI’s GPT-3.5 on MMLU, and is on par with Google’s PaLM 2-Large on HellaSwag, LAMBADA, WebQuestions, Winogrande, PIQA, ARC, BoolQ, CB, COPA, RTE, WiC, WSC, ReCoRD. Falcon 180B typically sits somewhere between GPT 3.5 and GPT4 depending on the evaluation benchmark and further finetuning from the community will likely be very interesting to follow now that it’s openly released.

With 68.74 on the Hugging Face Leaderboard on the time of release, Falcon 180B was the highest-scoring openly released pre-trained LLM, surpassing Meta’s Llama 2.*

Model	Size	Leaderboard rating	Business use or license	Pretraining length
Falcon	180B	67.85	🟠	3,500B
Llama 2	70B	67.87	🟠	2,000B
LLaMA	65B	61.19	🔴	1,400B
Falcon	40B	58.07	🟢	1,000B
MPT	30B	52.77	🟢	1,000B

The Open LLM Leaderboard added two recent benchmarks in November 2023, and we updated the table above to reflect the most recent rating (67.85). Falcon is on par with Llama 2 70B based on the brand new methodology.

The quantized Falcon models preserve similar metrics across benchmarks. The outcomes were similar when evaluating torch.float16, 8bit, and 4bit. See leads to the Open LLM Leaderboard.

How you can use Falcon 180B?

Falcon 180B is out there within the Hugging Face ecosystem, starting with Transformers version 4.33.

Demo

You may easily try the Big Falcon Model (180 billion parameters!) in this Space or within the playground embedded below:

Hardware requirements

We ran several tests on the hardware needed to run the model for various use cases. Those should not the minimum numbers, however the minimum numbers for the configurations we had access to.

	Type	Kind	Memory	Example
Falcon 180B	Training	Full fine-tuning	5120GB	8x 8x A100 80GB
Falcon 180B	Training	LoRA with ZeRO-3	1280GB	2x 8x A100 80GB
Falcon 180B	Training	QLoRA	160GB	2x A100 80GB
Falcon 180B	Inference	BF16/FP16	640GB	8x A100 80GB
Falcon 180B	Inference	GPTQ/int4	320GB	8x A100 40GB

Prompt format

The bottom model has no prompt format. Keep in mind that it’s not a conversational model or trained with instructions, so don’t expect it to generate conversational responses—the pretrained model is an amazing platform for further finetuning, but you most likely shouldn’t driectly use it out of the box. The Chat model has a quite simple conversation structure.

System: Add an optional system prompt here
User: That is the user input
Falcon: That is what the model generates
User: This is likely to be a second turn input
Falcon: and so forth

Transformers

With the discharge of Transformers 4.33, you should use Falcon 180B and leverage all of the tools within the HF ecosystem, corresponding to:

training and inference scripts and examples
secure file format (safetensors)
integrations with tools corresponding to bitsandbytes (4-bit quantization), PEFT (parameter efficient fine-tuning) and GPTQ
assisted generation (also generally known as “speculative decoding”)
RoPE scaling support for larger context lengths
wealthy and powerful generation parameters

Use of the model requires you to just accept its license and terms of use. Please, be sure you’re logged into your Hugging Face account and ensure you will have the most recent version of transformers:

pip install --upgrade transformers
huggingface-cli login

bfloat16

That is the way you’d use the bottom model in bfloat16. Falcon 180B is a giant model, so please take note of the hardware requirements summarized within the table above.

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model_id = "tiiuae/falcon-180B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

prompt = "My name is Pedro, I live in"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

output = model.generate(
    input_ids=inputs["input_ids"],
    attention_mask=inputs["attention_mask"],
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
    max_new_tokens=50,
)
output = output[0].to("cpu")
print(tokenizer.decode(output))

This might produce an output corresponding to:

My name is Pedro, I live in Portugal and I'm 25 years old. I'm a graphic designer, but I'm also captivated with photography and video.
I really like to travel and I'm all the time on the lookout for recent adventures. I really like to satisfy recent people and explore recent places.

8-bit and 4-bit with `bitsandbytes`

The 8-bit and 4-bit quantized versions of Falcon 180B show almost no difference in evaluation with respect to the bfloat16 reference! This is superb news for inference, as you’ll be able to confidently use a quantized version to scale back hardware requirements. Bear in mind, though, that 8-bit inference is much faster than running the model in 4-bit.

To make use of quantization, it is advisable install the bitsandbytes library and easily enable the corresponding flag when loading the model:

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    load_in_8bit=True,
    device_map="auto",
)

Chat Model

As mentioned above, the version of the model fine-tuned to follow conversations used a really straightforward training template. Now we have to follow the identical pattern with the intention to run chat-style inference. For reference, you’ll be able to take a take a look at the format_prompt function within the Chat demo, which looks like this:

def format_prompt(message, history, system_prompt):
    prompt = ""
    if system_prompt:
        prompt += f"System: {system_prompt}n"
    for user_prompt, bot_response in history:
        prompt += f"User: {user_prompt}n"
        prompt += f"Falcon: {bot_response}n"
        prompt += f"User: {message}nFalcon:"
    return prompt

As you’ll be able to see, interactions from the user and responses by the model are preceded by User: and Falcon: separators. We concatenate them together to form a prompt containing the conversation’s whole history. We are able to provide a system prompt to tweak the generation style.

Additional Resources

Acknowledgments

Releasing such a model with support and evaluations within the ecosystem wouldn’t be possible without the contributions of many community members, including Clémentine and Eleuther Evaluation Harness for LLM evaluations; Loubna and BigCode for code evaluations; Nicolas for Inference support; Lysandre, Matt, Daniel, Amy, Joao, and Arthur for integrating Falcon into transformers. Because of Baptiste and Patrick for the open-source demo. Because of Thom, Lewis, TheBloke, Nouamane, Tim Dettmers for multiple contributions enabling this to get out. Finally, due to the HF Cluster for enabling running LLM evaluations in addition to providing inference for a free, open-source demo of the model.

Source link

Spread Your Wings: Falcon 180B is here

What’s Falcon-180B?

How good is Falcon 180B?

How you can use Falcon 180B?

Demo

Hardware requirements

Prompt format

Transformers

bfloat16

8-bit and 4-bit with `bitsandbytes`

Chat Model

Additional Resources

Acknowledgments

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Fetch Cuts ML Processing Latency by 50% Using Amazon SageMaker & Hugging Face

Efficient Controllable Generation for SDXL with T2I-Adapters

SafeCoder vs. Closed-source Code Assistants

Overview of natively supported quantization schemes in 🤗 Transformers

The Hidden Opportunity in AI Workflow Automation with n8n for Low-Tech Corporations

Spread Your Wings: Falcon 180B is here

What’s Falcon-180B?

How good is Falcon 180B?

How you can use Falcon 180B?

Demo

Hardware requirements

Prompt format

Transformers

bfloat16

8-bit and 4-bit with bitsandbytes

Chat Model

Additional Resources

Acknowledgments

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

8-bit and 4-bit with `bitsandbytes`

What are your thoughts on this topic?
Let us know in the comments below.