CodeGemma is a family of open-access versions of Gemma specialized in code, and we’re excited to collaborate with Google on its release to make it as accessible as possible.🤗
CodeGemma is available in three flavors:
- A 2B base model specialized in infilling and open-ended generation.
- A 7B base model trained with each code infilling and natural language.
- A 7B instruct model a user can chat with about code.
We’ve collaborated with Google to make sure one of the best integration into the Hugging Face ecosystem. Yow will discover the three open-access models able to use on the Hub. Among the many features and integrations being released, we have now:
- Models on the Hub, with their model cards and licenses. There are versions for the transformers library, checkpoints to be used with Google’s original codebases, and full-precision GGUF files that the community can quantize.
- Transformers integration
- Integration with Google Cloud
- Integration with Inference Endpoints
- Code benchmarks
Table of contents
What’s CodeGemma?
CodeGemma is a family of code-specialist LLM models by Google, based on the pre-trained 2B and 7B Gemma checkpoints. CodeGemma are further trained on a further 500 billion tokens of primarily English language data, mathematics, and code to enhance on logical and mathematical reasoning, and are suitable for code completion and generation.
CodeGemma 2B was trained exclusively on Code Infilling and is supposed for fast code completion and generation, especially in settings where latency and/or privacy are crucial. CodeGemma 7B training mix includes code infilling data (80%) and natural language. It may well be used for code completion, in addition to code and language understanding and generation. CodeGemma 7B Instruct was fine-tuned for instruction following on top of CodeGemma 7B. It’s meant for conversational use, especially around code, programming, or mathematical reasoning topics. All of the models have the identical 8K token context size as their predecessors.
This image is from the unique report
Evaluation Results
CodeGemma-7B outperforms similarly-sized 7B models except DeepSeek-Coder-7B on HumanEval, a well-liked benchmark for evaluating code models on Python. The identical goes for the evaluation of other programming languages like Java, JavaScript, and C++ from MultiPL-E, a translation of HumanEval. Based on the technical report, the model performs best on GSM8K amongst 7B models. The instruct version CodeGemma-7B-it improves on the preferred languages on each HumanEval and MBPP (cf paper table 5). For more details, you’ll be able to check the BigCode leaderboard or some metrics below.
| Model | Pretraining size [tokens] | Python | JavaScript |
|---|---|---|---|
| 10B+ models | |||
| StarCoder 2 15B | 4,000B+ | 44.15 | 44.24 |
| Code Llama 13B | 2,500B | 35.07 | 38.26 |
| 7B models | |||
| DeepSeek Coder 7B | 2,000B | 45.83 | 45.9 |
| CodeGemma 7B | 500B of additional training | 40.13 | 43.06 |
| Code Llama 7B | 2,500B | 29.98 | 31.8 |
| StarCoder 2 7B | 3,500B+ | 34.09 | 35.35 |
| StarCoderBase 7B | 3,000B+ | 28.37 | 27.35 |
| <3B models | |||
| CodeGemma 2B | 500B of additional training | 27.28 | 29.94 |
| Stable Code 3B | 1,300B | 30.72 | 28.75 |
| StarCoder 2 3B | 3,000B+ | 31.44 | 35.37 |
| Model | Pretraining size [tokens] | Python | JavaScript |
|---|---|---|---|
| 10B+ models | |||
| Code Llama 13B | 2,620B | 50.6 | 40.92 |
| Code Llama 13B | 2,620B | 42.89 | 40.66 |
| 7B models | |||
| CodeGemma 7B | 500B | 52.74 | 47.71 |
| Code Llama 7B | 2,620B | 40.48 | 36.34 |
| Code Llama 7B | 2,620B | 25.65 | 33.11 |
Here’s a table from the unique report with a breakdown per language.
Prompt format
CodeGemma 2B and CodeGemma 7B use infilling (code, comments, docstrings, import statements) for code completion. CodeGemma was trained for this task using the fill-in-the-middle (FIM) objective, where you provide a prefix and a suffix as context for the completion. The next tokens are used to separate the various parts of the input:
<|fim_prefix|>precedes the context before the completion we wish to run.<|fim_suffix|>precedes the suffix. You could put this token exactly where the cursor could be positioned in an editor, as that is the placement where the model will code complete.<|fim_middle|>is the prompt that invites the model to run the generation.
Along with these, there’s also <|file_separator|>, which provides multi-file contexts. We’ll show examples of use within the Using with transformers section.
CodeGemma 7B Instruct uses the identical prompt format as the bottom Gemma Instruction-tuned versions, following this conversation structure:
user
knock knock
model
who is there
user
LaMDA
model
LaMDA who?
As is the case with Gemma, the best method to reproduce this format is with the chat template available in transformers.
Using CodeGemma
Demo
You may easily try the CodeGemma Model (7 billion parameters!) in this Space or within the Chatbot embedded below:
Under the hood, this playground uses Transformers implementation. You can too duplicate the Space in your use – it’s self-contained, so you’ll be able to examine the source code and adapt it as you want!
Using Transformers
With Transformers release 4.39, you should use CodeGemma and leverage all of the tools inside the Hugging Face ecosystem, reminiscent of:
- training and inference scripts and examples
- protected file format (
safetensors) - integrations with tools reminiscent of bitsandbytes (4-bit quantization), PEFT (parameter efficient fine-tuning), and Flash Attention 2
- utilities and helpers to run generation with the model
- mechanisms to export the models to deploy
Just like the Gemma models, CodeGemma is compatible with torch.compile() for a crucial inference speedup.
Bonus: We made a Colab notebook so that you can check out the model on the touch of a button here.
To make use of CodeGemma with transformers, be certain that to make use of the newest release:
pip install --upgrade transformers
The next snippet shows the way to use codegemma-2b for code completion with transformers. It requires about 6 GB of RAM using float16 precision, making it perfectly suitable for consumer GPUs and on-device applications.
from transformers import GemmaTokenizer, AutoModelForCausalLM
import torch
model_id = "google/codegemma-2b"
tokenizer = GemmaTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16
).to("cuda:0")
prompt = '''
<|fim_prefix|>import datetime
def calculate_age(birth_year):
"""Calculates an individual's age based on their birth yr."""
current_year = datetime.date.today().yr
<|fim_suffix|>
return age<|fim_middle|>
'''
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
prompt_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0][prompt_len:]))
Observe that the <|fim_suffix|> token appears within the position where the cursor could be placed in an editor, marking the position for the generation. <|fim_prefix|> provides the context that precedes the cursor, and the remaining until <|fim_middle|> is additional context after the cursor. Either of them will be empty if the cursor is positioned at first or end of the file.
The previous code may return something like the next:
age = current_year - birth_year<|file_separator|>test_calculate_age.py
<|fim_suffix|>
assert calculate_age(1990) == 33
assert calculate_age(1980) == 43
assert calculate_age(1970) == 53
assert calculate_age(1960) == 63
assert calculate_age(1950) == 73
Note the additional content after the proper completion. This is especially the case for CodeGemma 7B, which is more verbose and tends to supply additional code or comments after completion. We must ignore all the things that appears after the FIM tokens or the EOS token for code infilling. We are able to stop generation early with transformers by providing a listing of terminators to the generate function, like this:
FIM_PREFIX = '<|fim_prefix|>'
FIM_SUFFIX = '<|fim_suffix|>'
FIM_MIDDLE = '<|fim_middle|>'
FIM_FILE_SEPARATOR = '<|file_separator|>'
terminators = tokenizer.convert_tokens_to_ids(
[FIM_PREFIX, FIM_MIDDLE, FIM_SUFFIX, FIM_FILE_SEPARATOR]
)
terminators += [tokenizer.eos_token_id]
outputs = model.generate(
**inputs,
max_new_tokens=100,
eos_token_id=terminators,
)
On this case, generation will stop as soon as the primary delimiter is found:
age = current_year - birth_year<|file_separator|>
A note on precision
The unique CodeGemma checkpoints are released in bfloat16 precision. For those who load the model without indicating a torch_dtype, PyTorch will upcast them to float32. Casting to float16 is perfectly superb to be used, and it could possibly be much faster than bfloat16 on certain hardware. For max precision, we recommend you employ bfloat16 relatively than float32.
You can too robotically quantize the model, loading it in 8-bit or 4-bit mode. 4-bit loading of CodeGemma 7B takes about 9 GB of memory to run, making it compatible with many consumer cards and all of the GPUs in Google Colab. That is the way you’d load the generation pipeline in 4-bit:
pipeline = pipeline(
"text-generation",
model=model,
model_kwargs={
"torch_dtype": torch.float16,
"quantization_config": {"load_in_4bit": True}
},
)
Integration with Google Cloud
You may deploy and train Gemma on Google Cloud through Vertex AI or Google Kubernetes Engine (GKE), using Text Generation Inference and Transformers.
To deploy the CodeGemma model from Hugging Face, go to the model page and click on on Deploy -> Google Cloud. This can bring you to the Google Cloud Console, where you’ll be able to 1-click deploy CodeGemma on Vertex AI or GKE, powered by Text Generation Inference.
You can too access CodeGemma directly through the Vertex AI Model Garden.
Integration with Inference Endpoints
You may deploy CodeGemma on Hugging Face’s Inference Endpoints, which uses Text Generation Inference because the backend. Text Generation Inference is a production-ready inference container developed by Hugging Face to enable easy deployment of enormous language models. It has features reminiscent of continuous batching, token streaming, tensor parallelism for fast inference on multiple GPUs, production-ready logging and tracing, and is distributed under the Apache 2 license.
To deploy a CodeGemma model, go to the model page and click on on the Deploy -> Inference Endpoints widget. You may learn more about Deploying LLMs with Hugging Face Inference Endpoints in a previous blog post. Note that T4s don’t support the bfloat16 format, so you’ll need to make use of a unique GPU option.
from huggingface_hub import InferenceClient
client = InferenceClient(model=IE_ENDPOINT)
prompt = """
<|fim_prefix|>import <|fim_suffix|>
if __name__ == '__main__':
sys.exit(0)<|fim_middle|>
"""
client.text_generation(prompt=prompt)



