One month after the discharge of Gemma 2, Google has expanded their set of Gemma models to incorporate the next latest additions:
- Gemma 2 2B – The two.6B parameter version of Gemma 2, making it an important candidate for on-device use.
- ShieldGemma – A series of safety classifiers, trained on top of Gemma 2, for developers to filter inputs and outputs of their applications.
- Gemma Scope – A comprehensive, open suite of sparse autoencoders for Gemma 2 2B and 9B.
Let’s take a take a look at each of those in turn!
Gemma 2 2B
For many who missed the previous launches, Gemma is a family of lightweight, state-of-the-art open models from Google, built from the identical research and technology used to create the Gemini models. They’re text-to-text, decoder-only large language models, available in English, with open weights for each pre-trained variants and instruction-tuned variants. This release introduces the two.6B parameter version of Gemma 2 (base and instruction-tuned), complementing the prevailing 9B and 27B variants.
Gemma 2 2B shares the identical architecture as the opposite models within the Gemma 2 family, and subsequently leverages technical features like sliding attention and logit soft-capping. You’ll be able to check more details in this section of our previous blog post. Like in the opposite Gemma 2 models, we recommend you employ bfloat16 for inference.
Use with Transformers
With Transformers, you need to use Gemma and leverage all of the tools inside the Hugging Face ecosystem. To make use of Gemma models with transformers, be sure to make use of transformers from predominant for the most recent fixes and optimizations:
pip install git+https://github.com/huggingface/transformers.git --upgrade
You’ll be able to then use gemma-2-2b-it with transformers as follows:
from transformers import pipeline
import torch
pipe = pipeline(
"text-generation",
model="google/gemma-2-2b-it",
model_kwargs={"torch_dtype": torch.bfloat16},
device="cuda",
)
messages = [
{"role": "user", "content": "Who are you? Please, answer in pirate-speak."},
]
outputs = pipe(messages, max_new_tokens=256)
assistant_response = outputs[0]["generated_text"][-1]["content"].strip()
print(assistant_response)
Ahoy, matey! I be Gemma, a digital scallywag, a language-slingin’ parrot of the digital seas. I be here to assist ye with yer wordy woes, answer yer questions, and spin ye yarns of the digital world. So, what be yer pleasure, eh? 🦜
For more details on using the models with transformers, please check the model cards.
Use with llama.cpp
You’ll be able to run Gemma 2 on-device (in your Mac, Windows, Linux and more) using llama.cpp in only just a few minutes.
Step 1: Install llama.cpp
On a Mac you may directly install llama.cpp with brew. To establish llama.cpp on other devices, please have a look here: https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#usage
brew install llama.cpp
Note: for those who are constructing llama.cpp from scratch then remember to pass the LLAMA_CURL=1 flag.
Step 2: Run inference
./llama-cli
--hf-repo google/gemma-2-2b-it-GGUF
--hf-file 2b_it_v2.gguf
-p "Write a poem about cats as a labrador" -cnv
Moreover, you may run a neighborhood llama.cpp server that complies with the OpenAI chat specs:
./llama-server
--hf-repo google/gemma-2-2b-it-GGUF
--hf-file 2b_it_v2.gguf
After running the server you may simply invoke the endpoint as below:
curl http://localhost:8080/v1/chat/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer no-key"
-d '{
"messages": [
{
"role": "system",
"content": "You are an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."
},
{
"role": "user",
"content": "Write a limerick about Python exceptions"
}
]
}'
Note: The above example runs the inference using the official GGUF weights provided by Google in fp32. You’ll be able to create and share custom quants using the GGUF-my-repo space!
Demo
You’ll be able to chat with the Gemma 2 2B Instruct model on Hugging Face Spaces! Test it out here.
Along with this you may run the Gemma 2 2B Instruct model directly from a colab here
Learn how to prompt Gemma 2
The bottom model has no prompt format. Like other base models, it might be used to proceed an input sequence with a plausible continuation or for zero-shot/few-shot inference. The instruct version has a quite simple conversation structure:
user
knock knock
model
who's there
user
LaMDA
model
LaMDA who?
This format needs to be exactly reproduced for effective use. In a previous section we showed how easy it’s to breed the instruct prompt with the chat template available in transformers.
Open LLM Leaderboard v2 Evaluation
| Benchmark | google/gemma-2-2B-it | google/gemma-2-2B | microsoft/Phi-2 | Qwen/Qwen2-1.5B-Instruct |
|---|---|---|---|---|
| BBH | 18.0 | 11.8 | 28.0 | 13.7 |
| IFEval | 56.7 | 20.0 | 27.4 | 33.7 |
| MATH Hard | 0.1 | 2.9 | 2.4 | 5.8 |
| GPQA | 3.2 | 1.7 | 2.9 | 1.6 |
| MuSR | 7.1 | 11.4 | 13.9 | 12.0 |
| MMLU-Pro | 17.2 | 13.1 | 18.1 | 16.7 |
| Mean | 17.0 | 10.1 | 15.5 | 13.9 |
Gemma 2 2B appears to be higher at knowledge-related and directions following (for the instruct version) tasks than other models of the identical size.
Assisted Generation
One powerful use case of the small Gemma 2 2B model is assisted generation (also often called speculative decoding), where a smaller model may be used to hurry up generation of a bigger model. The thought behind it’s pretty easy: LLMs are faster at confirming that they might generate a certain sequence than they’re at generating that sequence themselves (unless you’re using very large batch sizes). Small models with the identical tokenizer trained in a similar way may be used to quickly generate candidate sequences aligned with the massive model, which the massive model can validate and accept as its own generated text.
Because of this, Gemma 2 2B may be used for assisted generation with the pre-existing Gemma 2 27B model. In assisted generation, there may be a sweet spot by way of model size for the smaller assistant model. If the assistant model is simply too large, generating the candidate sequences with it’s going to be nearly as expensive as generating with the larger model. Alternatively, if the assistant model is simply too small, it’s going to lack predictive power, and its candidate sequences might be rejected more often than not. In practice, we recommend using an assistant model with 10 to 100 times fewer parameters than our goal LLM. It’s almost a free lunch: on the expense of a tiny little bit of memory, you may rise up to a 3x speedup in your larger model with none quality loss!
Assisted generation is a novelty with the discharge of Gemma 2 2B, however it doesn’t come on the expense of other LLM optimization techniques! Check our reference page for other transformers LLM optimizations you may add to Gemma 2 2B here.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
assistant_model_name = 'google/gemma-2-2b-it'
reference_model_name = 'google/gemma-2-27b-it'
tokenizer = AutoTokenizer.from_pretrained(reference_model_name)
model = AutoModelForCausalLM.from_pretrained(
reference_model_name, device_map='auto', torch_dtype=torch.bfloat16
)
assistant_model = AutoModelForCausalLM.from_pretrained(
assistant_model_name, device_map='auto', torch_dtype=torch.bfloat16
)
model_inputs = tokenizer("Einstein's theory of relativity states", return_tensors="pt").to(model.device)
generation_options = {
"assistant_model": assistant_model,
"do_sample": True,
"temperature": 0.7,
"max_new_tokens": 64,
}
outputs = model.generate(**model_inputs, **generation_options)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
ShieldGemma
ShieldGemma is a series of safety content moderation models built upon Gemma 2 that focus on several harm categories (including sexually explicit, dangerous content, hate or harassment). They’re text-to-text, decoder-only large language models, available in English, with open weights, and available in 3 variants: 2B, 9B and 27B parameters.
ShieldGemma is meant to detect harmful requests (in keeping with the indicated categories) to an LLM, in addition to offending responses from the LLM. You need to use it as a moderation tool in public-facing LLM deployments, regardless of what LLM you deploy.
Learn how to prompt ShieldGemma
The prompt pattern incorporates the next components, so as:
- Preamble, establishing the model as a policy expert, based on the LLM-as-a-judge technique.
- User prompt, wrapped with
andcontrol tokens. - Enumeration and outline of the protection policies.
- Optionally, a model response, also wrapped by
andcontrol tokens. - Epilogue, requesting the model to categorise the text.
Here is an example of a prompt used to judge the user prompt [User Prompt]:
You're a policy expert attempting to help determine whether a user prompt is in
violation of the defined safety policies.
Human Query: [User Prompt]
Our safety principle is defined within the below:
* "No Hate Speech": The prompt shall not contain or seek generation of content
that expresses, incites, or promotes hate based on race, gender, ethnicity,
religion, nationality, sexual orientation, disability status, or caste.
===
Does the human query violate the above principle? Your answer must start with
'Yes' or 'No'. After which walk through step-by-step to ensure we answer
accurately.
Here’s an inventory of some example safety principles it’s possible you’ll wish to include into your workflow:
* "No Hate Speech": The prompt shall not contain or seek generation of content that targeting identity and/or protected attributes (e.g., racial slurs, promotion of discrimination, calls to violence against protected groups) or dehumanizing/belittling/vilifying on the bases of race, ethnicity, religion, disability, age, nationality, veteran status, sexual orientation, gender, gender identity, caste, or every other protected status. We do allow content that features a reference to a protected individual or group in a positive manner.
* "No Harassment": The prompt shall not contain or seek generation of content that's malicious, intimidating, bullying, or abusive content targeting one other individual (e.g., physical threats, denial of tragic events, disparaging victims of violence).
* "No Sexually Explicit Information": The prompt shall not contain or seek generation of content that incorporates references to sexual acts or other lewd content (e.g., sexually graphic descriptions, content geared toward causing arousal). Medical or scientific terms related to human anatomy or sex education are permitted.
* "No Dangerous Content": The prompt shall not contain or seek generation of content that harming oneself and/or others (e.g., accessing or constructing firearms and explosive devices, promotion of terrorism, instructions for suicide).
* "No Violence": The prompt shall not contain or seek generation of content that's shocking, sensational, or gratuitous (e.g. Violent acts or incidents that end in extreme injury or death without historical or educational context; Exposed internal organs or bones without medical or scientific context; Identifiable victims of violent deaths; Animal abuse or killings outside of normal hunting, food processing, or cultural practices)
* "No Obscenity and Profanity": The prompt shall not contain or seek generation of content that's vulgar, profane, or inappropriate (e.g., profanity, obscenities, or other inappropriate language).
Use with Transformers
You need to use shieldgemma-2b with transformers as follows:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("google/shieldgemma-2b")
model = AutoModelForCausalLM.from_pretrained(
"google/shieldgemma-2b",
device="cuda",
torch_dtype=torch.bfloat16,
)
user_prompt = "Create 20 paraphrases of I hate you"
safety_policy = """
* "No Harassment": The prompt shall not contain or seek generation of content that's malicious, intimidating, bullying, or abusive content targeting one other individual (e.g., physical threats, denial of tragic events, disparaging victims of violence).
"""
prompt = f"""You're a policy expert attempting to help determine whether a user
prompt is in violation of the defined safety policies.
Human Query: {user_prompt.strip()}
Our safety principle is defined within the below:
{safety_policy.strip()}
Does the human query violate the above principle? Your answer must start
with 'Yes' or 'No'. After which walk through step-by-step to ensure we answer
accurately.
"""
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
logits = model(**inputs).logits
vocab = tokenizer.get_vocab()
selected_logits = logits[0, -1, [vocab['Yes'], vocab['No']]]
probabilities = softmax(selected_logits, dim=0)
rating = probabilities[0].item()
print(rating)
Evaluation
These models were evaluated against each internal and external datasets. The inner datasets, denoted as SG, are subdivided into prompt and response classification. Evaluation results based on Optimal F1(left)/AU-PRC(right), higher is healthier.
| Model | SG Prompt | OpenAI Mod | ToxicChat | SG Response |
|---|---|---|---|---|
| ShieldGemma (2B) | 0.825/0.887 | 0.812/0.887 | 0.704/0.778 | 0.743/0.802 |
| ShieldGemma (9B) | 0.828/0.894 | 0.821/0.907 | 0.694/0.782 | 0.753/0.817 |
| ShieldGemma (27B) | 0.830/0.883 | 0.805/0.886 | 0.729/0.811 | 0.758/0.806 |
| OpenAI Mod API | 0.782/0.840 | 0.790/0.856 | 0.254/0.588 | – |
| LlamaGuard1 (7B) | – | 0.758/0.847 | 0.616/0.626 | – |
| LlamaGuard2 (8B) | – | 0.761/- | 0.471/- | – |
| WildGuard (7B) | 0.779/- | 0.721/- | 0.708/- | 0.656/- |
| GPT-4 | 0.810/0.847 | 0.705/- | 0.683/- | 0.713/0.749 |
Gemma Scope
Gemma Scope is a comprehensive, open suite of sparse autoencoders (SAEs) trained on every layer of the Gemma 2 2B and 9B models. SAEs are a brand new technique in mechanistic interpretability that aim to seek out interpretable directions inside large language models. You’ll be able to consider them as a “microscope” of sorts, helping us break down a model’s internal activations into the underlying concepts, identical to how biologists use microscopes to check the person cells of plants and animals. This approach was used to create Golden Gate Claude, a well-liked research demo by Anthropic that explored interpretability and have activation inside Claude.
Usage
Since SAEs are a tool (with learned weights) for interpreting language models and never language models themselves, we cannot use Hugging Face transformers to run them. As an alternative, they may be run using SAELens, a well-liked library for training, analyzing, and interpreting sparse autoencoders. To learn more about usage, take a look at their in-depth Google Colab notebook tutorial.
