Google released Gemma 2, the most recent addition to its family of state-of-the-art open LLMs, and we’re excited to collaborate with Google to make sure one of the best integration within the Hugging Face ecosystem. Yow will discover the 4 open-weight models (2 base models & 2 fine-tuned ones) on the Hub. Among the many features and integrations being released, we have now:
Table of contents
What’s Gemma 2?
Gemma 2 is Google’s latest iteration of open LLMs. It is available in two sizes, 9 billion and 27 billion parameters with base (pre-trained) and instruction-tuned versions. Gemma relies on Google Deepmind Gemini and has a context length of 8K tokens:
The Gemma 2 models were trained on ~2x more data than their first iteration, totaling 13 trillion tokens for the 27B version and eight trillion tokens for the 9B version of web data (primarily English), code, and math. We don’t know the precise details of the training mix, and we are able to only guess that greater and more careful data curation was an enormous consider the improved performance.
Gemma 2 comes with the same license as the primary iteration, which is a permissive license that permits redistribution, fine-tuning, industrial use, and derivative works.
Technical advances in Gemma 2
Gemma 2 has many similarities with the primary iteration. It has a context length of 8192 tokens and uses Rotary Position Embedding (RoPE). There are 4 predominant advances in Gemma 2 in comparison with the unique Gemma:
- Sliding window attention: Interleave sliding window and full-quadratic attention for quality generation.
- Logit soft-capping: Prevents logits from growing excessively by scaling them to a set range, improving training.
- Knowledge Distillation: Leverage a bigger teacher model to coach a smaller model (for the 9B model).
- Model Merging: Combines two or more LLMs right into a single latest model
Gemma 2 was trained on Google Cloud TPU (27B on v5p, 9B on TPU v4) using JAX and ML Pathways. Gemma 2 Instruct has been optimized for dialogue applications and trained on a mixture of synthetic and human-generated prompt-response pairs using Supervised Positive-Tuning (SFT), Distillation from a bigger model, Reinforcement Learning from Human Feedback (RLHF) using a reward model oriented more towards conversational capabilities, and model merging using WARP to enhance overall performance.
Much like the pre-training mix, no details in regards to the fine-tuning datasets or the hyperparameters related to SFT and RLHF have been shared.
Sliding window attention
Sliding window attention is a technique to cut back the memory and time requirements of the eye computations in transformer models and has been utilized in models similar to Mistral. The novelty of Gemma 2 is that a sliding window is applied to each other layer (local – 4096 tokens), while the layers in between still use full quadratic global attention (8192 tokens). We suppose this can be a option to increase quality in long context situations (half of the layers still attend to all tokens) while partially benefiting from some great benefits of sliding attention.
Soft-capping and a spotlight implementations
Soft capping is a method that stops logits from growing excessively large without truncating them. It really works by dividing the logits by a maximum value threshold (soft_cap), then passing them through a tanh layer (ensuring they’re within the (-1, 1) range), and eventually multiplying by the brink again. This guarantees that the ultimate values can be within the (-soft_cap, +soft_cap) interval without losing much information but stabilizing the training.
Putting all of it together, the logits are calculated by: logits ← soft_cap ∗ tanh(logits/soft_cap)
Gemma 2 employs soft capping for the ultimate layer and for each attention layer. The eye logits are capped at 50.0, and the ultimate logits at 30.0.
On the time of release, soft-capping is incompatible with Flash Attention / SDPA, but they will still be utilized in inference for optimum efficiency. The Gemma 2 team observed very minor differences when soft-capping is removed during inference.
Note: For stable fine-tuning runs, you continue to have to enable soft-capping and hence, we recommend fine-tuning with eager attention as a substitute of SDPA.
Knowledge Distillation
Knowledge distillation is a well-liked technique for training a smaller student model to mimic the behavior of a bigger but better-performing teacher. This works by augmenting the next-token prediction task of LLMs with a distribution of token probabilities from the teacher (e.g., GPT-4, Claude, or Gemini), which provides a richer signal for the coed to learn from.
In keeping with the Gemma 2 tech report, knowledge distillation was used to pre-train the 9B model, while the 27B model was pre-trained from scratch.
For post-training, the Gemma 2 team generated a various set of completions from a teacher (unspecified within the report, but presumably Gemini Ultra), after which trained the coed models on this synthetic data with SFT. That is the premise of many open models, similar to Zephyr and OpenHermes, that are trained entirely on synthetic data from larger LLMs.
Although effective, this method has drawbacks because the model capability mismatch between the coed and teacher can result in a train-inference mismatch, where the text generated by the coed during inference is out-of-distribution in comparison with that seen during training.
To handle this issue, the Gemma 2 team used “on-policy distillation”, where the coed generates completions from the SFT prompts. These completions are then used to compute the KL divergence between the teacher’s and student’s logits. By minimizing the KL divergence throughout training, the coed learns to model the behavior of the teacher accurately while also minimizing the train-inference mismatch.
This approach is sort of interesting, as we’ve seen in the neighborhood that on-policy methods like online DPO produce stronger models, and one advantage of on-policy distillation is that you just only need the logits from the teacher, so that you don’t have to depend on reward models or LLM-as-a-judge to enhance the model. It can be exciting to see if this method becomes more popular amongst fine-tuners in the approaching months!
Model Merging
Model merging is a method that combines two or more LLMs right into a single latest model. It’s relatively latest and experimental and could be used without accelerators. Mergekit is a well-liked open-source toolkit for merging LLMs. It implements linear, SLERP, TIES, DARE, and other merging techniques.
In keeping with the Technical Report, Gemma 2 used Warp, a brand new merging technique that merges models in three distinct stages:
- Exponential Moving Average (EMA): That is applied through the reinforcement learning (RL) fine-tuning process.
- Spherical Linear intERPolation (SLERP): That is applied after the RL fine-tuning of multiple policies.
- Linear Interpolation Towards Initialization (LITI): This stage is applied after the SLERP stage.
Gemma 2 evaluation
How good are the Gemma models? Below are performance comparisons to other open models based on the Technical Report and the new edition of the open LLM Leaderboard.
Technical Report results
This Technical Report of Gemma 2 compares the performance of various open LLMs on the previous Open LLM Leaderboard benchmarks.
| Llama 3 (70B) | Qwen 1.5 (32B) | Gemma 2 (27B) | |
|---|---|---|---|
| MMLU | 79.2 | 74.3 | 75.2 |
| GSM8K | 76.9 | 61.1 | 75.1 |
| ARC-c | 68.8 | 63.6 | 71.4 |
| HellaSwag | 88.0 | 85.0 | 86.4 |
| Winogrande | 85.3 | 81.5 | 83.7 |
The Report also compares the performance of Small Language Models.
| Benchmark | Mistral (7B) | Llama 3 (8B) | Gemma (8B) | Gemma 2 (9B) |
|---|---|---|---|---|
| MMLU | 62.5 | 66.6 | 64.4 | 71.3 |
| GSM8K | 34.5 | 45.7 | 50.9 | 62.3 |
| ARC-C | 60.5 | 59.2 | 61.1 | 68.4 |
| HellaSwag | 83.0 | 82.0 | 82.3 | 81.9 |
| Winogrande | 78.5 | 78.5 | 79.0 | 80.6 |
Open LLM Leaderboard results
Note: We’re currently evaluating Google Gemma 2 individually on the brand new Open LLM Leaderboard benchmark and can update this section later today.
The best way to prompt Gemma 2
The bottom models haven’t any prompt format. Like other base models, they could be used to proceed an input sequence with a plausible continuation or for zero-shot/few-shot inference. The Instruct versions have a quite simple conversation structure:
user
knock knock
model
who's there
user
LaMDA
model
LaMDA who?
This format must be exactly reproduced for effective use. We’ll later show how easy it’s to breed the instruct prompt with the chat template available in transformers.
Demo
You’ll be able to chat with the Gemma 27B Instruct model on Hugging Chat! Try the link here: https://huggingface.co/chat/models/google/gemma-2-27b-it.
Using Hugging Face Transformers
With Transformers release 4.42, you need to use Gemma and leverage all of the tools inside the Hugging Face ecosystem. To make use of Gemma models with transformers, ensure that to make use of the most recent transformers release:
pip install "transformers>=4.42.3" --upgrade
The next snippet shows the way to use gemma-2-9b-it with transformers. It requires about 18 GB of RAM, which inserts many consumer GPUs. The identical snippet works for gemma-2-27b-it, which, at 56GB of RAM, makes it a really interesting model for production use cases. Memory consumption could be further reduced by loading in 8-bit or 4-bit mode.
from transformers import pipeline
import torch
pipe = pipeline(
"text-generation",
model="google/gemma-2-9b-it",
model_kwargs={"torch_dtype": torch.bfloat16},
device="cuda",
)
messages = [
{"role": "user", "content": "Who are you? Please, answer in pirate-speak."},
]
outputs = pipe(
messages,
max_new_tokens=256,
do_sample=False,
)
assistant_response = outputs[0]["generated_text"][-1]["content"]
print(assistant_response)
Ahoy, matey! I be a humble ship o’ words, sailin’ the digital seas. They call me Gemma, a creation o’ the advantageous folks at Google DeepMind. I be trained on a treasure trove o’ texts, learnin’ to talk and write like a real scallywag.
Ask me yer questions, and I’ll do me best to reply ’em, aye! 🦜📚
We used bfloat16 because that’s the reference precision for the instruction-tuned model. Running in float16 could also be faster in your hardware, and results must be similar on the 9B model. Do note, nevertheless, that the 27B instruction-tuned model produces erratic outputs when using float16: it’s essential to use bfloat16 for that model weight.
You can even routinely quantize the model, loading it in 8-bit and even 4-bit mode. 4-bit loading of the massive 27B version takes about 18 GB of memory to run, making it compatible with quite a lot of consumer cards and GPUs in Google Colab. That is the way you’d load the generation pipeline in 4-bit:
pipeline = pipeline(
"text-generation",
model=model,
model_kwargs={
"torch_dtype": torch.bfloat16,
"quantization_config": {"load_in_4bit": True}
},
)
For more details on using the models with transformers, please check the model cards.
Integration with Google Cloud
Note: We’re currently working on adding latest containers to GKE and Vertex AI to run Google Gemma 2 efficiently. We are going to update this section as soon because the containers can be found.
Positive-tuning with 🤗 TRL
Training LLMs could be technically and computationally difficult. On this section, we’ll have a look at the tools available within the Hugging Face ecosystem to efficiently train Gemma on consumer-size GPUs
An example command to fine-tune Gemma on OpenAssistant’s chat dataset could be found below. We use 4-bit quantization and QLoRA to conserve memory to focus on all the eye blocks’ linear layers. Note that, unlike dense transformers, one mustn’t goal the MLP layers as they’re sparse and don’t interact well with PEFT.
First, install the nightly version of 🤗 TRL and clone the repo to access the training script:
pip install "transformers>=4.42.3" --upgrade
pip install --upgrade bitsandbytes
pip install --ugprade peft
pip install git+https:
git clone https:
cd trl
Then you definitely can run the script:
python
examples/scripts/sft.py
--model_name google/gemma-2-27b
--dataset_name OpenAssistant/oasst_top1_2023-08-25
--dataset_text_field="text"
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 4
--learning_rate 2e-4
--report_to wandb
--bf16
--max_seq_length 1024
--lora_r 16 --lora_alpha 32
--lora_target_modules q_proj k_proj v_proj o_proj
--load_in_4bit
--use_peft
--attn_implementation eager
--logging_steps=10
--gradient_checkpointing
--output_dir models/gemma2
If you’ve got more GPUs to spare, you’ll be able to run training with DeepSpeed and ZeRO Stage 3:
speed up launch --config_file=examples/accelerate_configs/deepspeed_zero3.yaml
examples/scripts/sft.py
--model_name google/gemma-2-27b
--dataset_name OpenAssistant/oasst_top1_2023-08-25
--dataset_text_field="text"
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 4
--learning_rate 2e-5
--report_to wandb
--bf16
--max_seq_length 1024
--attn_implementation eager
--logging_steps=10
--gradient_checkpointing
--output_dir models/gemma2
Integration with Inference Endpoints
You’ll be able to deploy Gemma 2 on Hugging Face’s Inference Endpoints using Text Generation Inference because the backend. Text Generation Inference is a production-ready inference container developed by Hugging Face to enable easy deployment of enormous language models. It has features similar to continuous batching, token streaming, tensor parallelism for fast inference on multiple GPUs, and production-ready logging and tracing.
To deploy a Gemma 2 model, go to the model page and click on on the Deploy -> Inference Endpoints widget. Inference Endpoints supports OpenAI compatible Messages API that lets you switch from one other closed model to an open one by simply changing the URL.
from openai import OpenAI
client = OpenAI(
base_url="" + "/v1/",
api_key="" ,
)
chat_completion = client.chat.completions.create(
model="tgi",
messages=[
{"role": "user", "content": "Why is open-source software important?"},
],
stream=True,
max_tokens=500
)
for message in chat_completion:
print(message.decisions[0].delta.content, end="")
Additional Resources
Acknowledgments
Releasing such models with support and evaluations within the ecosystem wouldn’t be possible without the contributions of many community members, including Clémentine and Nathan for LLM evaluations; Nicolas for Text Generation Inference Support; Arthur, Sanchit, Joao, and Lysandre for integrating Gemma 2 into transformers; Nathan and Victor for making Gemma 2 available in Hugging Chat.
And Thanks to the Google Team for releasing Gemma 2 and making it available to the open-source AI community!
