an official Google release for code LLMs

-



CodeGemma is a family of open-access versions of Gemma specialized in code, and we’re excited to collaborate with Google on its release to make it as accessible as possible.🤗

CodeGemma is available in three flavors:

  • A 2B base model specialized in infilling and open-ended generation.
  • A 7B base model trained with each code infilling and natural language.
  • A 7B instruct model a user can chat with about code.

We’ve collaborated with Google to make sure one of the best integration into the Hugging Face ecosystem. Yow will discover the three open-access models able to use on the Hub. Among the many features and integrations being released, we have now:

  • Models on the Hub, with their model cards and licenses. There are versions for the transformers library, checkpoints to be used with Google’s original codebases, and full-precision GGUF files that the community can quantize.
  • Transformers integration
  • Integration with Google Cloud
  • Integration with Inference Endpoints
  • Code benchmarks



Table of contents



What’s CodeGemma?

CodeGemma is a family of code-specialist LLM models by Google, based on the pre-trained 2B and 7B Gemma checkpoints. CodeGemma are further trained on a further 500 billion tokens of primarily English language data, mathematics, and code to enhance on logical and mathematical reasoning, and are suitable for code completion and generation.

CodeGemma 2B was trained exclusively on Code Infilling and is supposed for fast code completion and generation, especially in settings where latency and/or privacy are crucial. CodeGemma 7B training mix includes code infilling data (80%) and natural language. It may well be used for code completion, in addition to code and language understanding and generation. CodeGemma 7B Instruct was fine-tuned for instruction following on top of CodeGemma 7B. It’s meant for conversational use, especially around code, programming, or mathematical reasoning topics. All of the models have the identical 8K token context size as their predecessors.

The CodeGemma family

This image is from the unique report



Evaluation Results

CodeGemma-7B outperforms similarly-sized 7B models except DeepSeek-Coder-7B on HumanEval, a well-liked benchmark for evaluating code models on Python. The identical goes for the evaluation of other programming languages like Java, JavaScript, and C++ from MultiPL-E, a translation of HumanEval. Based on the technical report, the model performs best on GSM8K amongst 7B models. The instruct version CodeGemma-7B-it improves on the preferred languages on each HumanEval and MBPP (cf paper table 5). For more details, you’ll be able to check the BigCode leaderboard or some metrics below.

Model Pretraining size [tokens] Python JavaScript
10B+ models
StarCoder 2 15B 4,000B+ 44.15 44.24
Code Llama 13B 2,500B 35.07 38.26
7B models
DeepSeek Coder 7B 2,000B 45.83 45.9
CodeGemma 7B 500B of additional training 40.13 43.06
Code Llama 7B 2,500B 29.98 31.8
StarCoder 2 7B 3,500B+ 34.09 35.35
StarCoderBase 7B 3,000B+ 28.37 27.35
<3B models
CodeGemma 2B 500B of additional training 27.28 29.94
Stable Code 3B 1,300B 30.72 28.75
StarCoder 2 3B 3,000B+ 31.44 35.37
Model Pretraining size [tokens] Python JavaScript
10B+ models
Code Llama 13B 2,620B 50.6 40.92
Code Llama 13B 2,620B 42.89 40.66
7B models
CodeGemma 7B 500B 52.74 47.71
Code Llama 7B 2,620B 40.48 36.34
Code Llama 7B 2,620B 25.65 33.11

Here’s a table from the unique report with a breakdown per language.

CodeGemma quality across languages



Prompt format

CodeGemma 2B and CodeGemma 7B use infilling (code, comments, docstrings, import statements) for code completion. CodeGemma was trained for this task using the fill-in-the-middle (FIM) objective, where you provide a prefix and a suffix as context for the completion. The next tokens are used to separate the various parts of the input:

  • <|fim_prefix|> precedes the context before the completion we wish to run.
  • <|fim_suffix|> precedes the suffix. You could put this token exactly where the cursor could be positioned in an editor, as that is the placement where the model will code complete.
  • <|fim_middle|> is the prompt that invites the model to run the generation.

Along with these, there’s also <|file_separator|>, which provides multi-file contexts. We’ll show examples of use within the Using with transformers section.

CodeGemma 7B Instruct uses the identical prompt format as the bottom Gemma Instruction-tuned versions, following this conversation structure:

user
knock knock
model
who is there
user
LaMDA
model
LaMDA who?

As is the case with Gemma, the best method to reproduce this format is with the chat template available in transformers.



Using CodeGemma



Demo

You may easily try the CodeGemma Model (7 billion parameters!) in this Space or within the Chatbot embedded below:

Under the hood, this playground uses Transformers implementation. You can too duplicate the Space in your use – it’s self-contained, so you’ll be able to examine the source code and adapt it as you want!



Using Transformers

With Transformers release 4.39, you should use CodeGemma and leverage all of the tools inside the Hugging Face ecosystem, reminiscent of:

  • training and inference scripts and examples
  • protected file format (safetensors)
  • integrations with tools reminiscent of bitsandbytes (4-bit quantization), PEFT (parameter efficient fine-tuning), and Flash Attention 2
  • utilities and helpers to run generation with the model
  • mechanisms to export the models to deploy

Just like the Gemma models, CodeGemma is compatible with torch.compile() for a crucial inference speedup.

Bonus: We made a Colab notebook so that you can check out the model on the touch of a button here.

To make use of CodeGemma with transformers, be certain that to make use of the newest release:

pip install --upgrade transformers

The next snippet shows the way to use codegemma-2b for code completion with transformers. It requires about 6 GB of RAM using float16 precision, making it perfectly suitable for consumer GPUs and on-device applications.

from transformers import GemmaTokenizer, AutoModelForCausalLM
import torch

model_id = "google/codegemma-2b"
tokenizer = GemmaTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16
).to("cuda:0")

prompt = '''
<|fim_prefix|>import datetime
def calculate_age(birth_year):
    """Calculates an individual's age based on their birth yr."""
    current_year = datetime.date.today().yr
    <|fim_suffix|>
    return age<|fim_middle|>
'''

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
prompt_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0][prompt_len:]))

Observe that the <|fim_suffix|> token appears within the position where the cursor could be placed in an editor, marking the position for the generation. <|fim_prefix|> provides the context that precedes the cursor, and the remaining until <|fim_middle|> is additional context after the cursor. Either of them will be empty if the cursor is positioned at first or end of the file.

The previous code may return something like the next:

age = current_year - birth_year<|file_separator|>test_calculate_age.py
<|fim_suffix|>
    assert calculate_age(1990) == 33
    assert calculate_age(1980) == 43
    assert calculate_age(1970) == 53
    assert calculate_age(1960) == 63
    assert calculate_age(1950) == 73

Note the additional content after the proper completion. This is especially the case for CodeGemma 7B, which is more verbose and tends to supply additional code or comments after completion. We must ignore all the things that appears after the FIM tokens or the EOS token for code infilling. We are able to stop generation early with transformers by providing a listing of terminators to the generate function, like this:

FIM_PREFIX = '<|fim_prefix|>'
FIM_SUFFIX = '<|fim_suffix|>'
FIM_MIDDLE = '<|fim_middle|>'
FIM_FILE_SEPARATOR = '<|file_separator|>'

terminators = tokenizer.convert_tokens_to_ids(
    [FIM_PREFIX, FIM_MIDDLE, FIM_SUFFIX, FIM_FILE_SEPARATOR]
)
terminators += [tokenizer.eos_token_id]

outputs = model.generate(
  **inputs,
  max_new_tokens=100,
  eos_token_id=terminators,
)

On this case, generation will stop as soon as the primary delimiter is found:

age = current_year - birth_year<|file_separator|>



A note on precision

The unique CodeGemma checkpoints are released in bfloat16 precision. For those who load the model without indicating a torch_dtype, PyTorch will upcast them to float32. Casting to float16 is perfectly superb to be used, and it could possibly be much faster than bfloat16 on certain hardware. For max precision, we recommend you employ bfloat16 relatively than float32.

You can too robotically quantize the model, loading it in 8-bit or 4-bit mode. 4-bit loading of CodeGemma 7B takes about 9 GB of memory to run, making it compatible with many consumer cards and all of the GPUs in Google Colab. That is the way you’d load the generation pipeline in 4-bit:

pipeline = pipeline(
    "text-generation",
    model=model,
    model_kwargs={
        "torch_dtype": torch.float16,
        "quantization_config": {"load_in_4bit": True}
    },
)



Integration with Google Cloud

You may deploy and train Gemma on Google Cloud through Vertex AI or Google Kubernetes Engine (GKE), using Text Generation Inference and Transformers.

To deploy the CodeGemma model from Hugging Face, go to the model page and click on on Deploy -> Google Cloud. This can bring you to the Google Cloud Console, where you’ll be able to 1-click deploy CodeGemma on Vertex AI or GKE, powered by Text Generation Inference.

You can too access CodeGemma directly through the Vertex AI Model Garden.

GCP Integration



Integration with Inference Endpoints

You may deploy CodeGemma on Hugging Face’s Inference Endpoints, which uses Text Generation Inference because the backend. Text Generation Inference is a production-ready inference container developed by Hugging Face to enable easy deployment of enormous language models. It has features reminiscent of continuous batching, token streaming, tensor parallelism for fast inference on multiple GPUs, production-ready logging and tracing, and is distributed under the Apache 2 license.

To deploy a CodeGemma model, go to the model page and click on on the Deploy -> Inference Endpoints widget. You may learn more about Deploying LLMs with Hugging Face Inference Endpoints in a previous blog post. Note that T4s don’t support the bfloat16 format, so you’ll need to make use of a unique GPU option.

from huggingface_hub import InferenceClient

client = InferenceClient(model=IE_ENDPOINT)

prompt = """
<|fim_prefix|>import <|fim_suffix|>

if __name__ == '__main__':
  sys.exit(0)<|fim_middle|>
"""

client.text_generation(prompt=prompt)



Additional Resources



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x