Llama 2 learns to code

Code Llama is a family of state-of-the-art, open-access versions of Llama 2 specialized on code tasks, and we’re excited to release integration within the Hugging Face ecosystem! Code Llama has been released with the identical permissive community license as Llama 2 and is on the market for business use.

Today, we’re excited to release:

Models on the Hub with their model cards and license
Transformers integration
Integration with Text Generation Inference for fast and efficient production-ready inference
Integration with Inference Endpoints
Integration with VS Code extension
Code benchmarks

Code LLMs are an exciting development for software engineers because they’ll boost productivity through code completion in IDEs, care for repetitive or annoying tasks like writing docstrings, or create unit tests.

What’s Code Llama?

The Code Llama release introduces a family of models of seven, 13, and 34 billion parameters. The bottom models are initialized from Llama 2 after which trained on 500 billion tokens of code data. Meta fine-tuned those base models for 2 different flavors: a Python specialist (100 billion additional tokens) and an instruction fine-tuned version, which might understand natural language instructions.

The models show state-of-the-art performance in Python, C++, Java, PHP, C#, TypeScript, and Bash. The 7B and 13B base and instruct variants support infilling based on surrounding content, making them ideal to be used as code assistants.

Code Llama was trained on a 16k context window. As well as, the three model variants had additional long-context fine-tuning, allowing them to administer a context window of as much as 100,000 tokens.

Increasing Llama 2’s 4k context window to Code Llama’s 16k (that may extrapolate as much as 100k) was possible as a consequence of recent developments in RoPE scaling. The community found that Llama’s position embeddings might be interpolated linearly or within the frequency domain, which eases the transition to a bigger context window through fine-tuning. Within the case of Code Llama, the frequency domain scaling is finished with a slack: the fine-tuning length is a fraction of the scaled pretrained length, giving the model powerful extrapolation capabilities.

All models were initially trained with 500 billion tokens on a near-deduplicated dataset of publicly available code. The dataset also incorporates some natural language datasets, akin to discussions about code and code snippets. Unfortunately, there will not be more information in regards to the dataset.

For the instruction model, they used two datasets: the instruction tuning dataset collected for Llama 2 Chat and a self-instruct dataset. The self-instruct dataset was created through the use of Llama 2 to create interview programming questions after which using Code Llama to generate unit tests and solutions, that are later evaluated by executing the tests.

Methods to use Code Llama?

Code Llama is on the market within the Hugging Face ecosystem, starting with transformers version 4.33.

Demo

You’ll be able to easily try the Code Llama Model (13 billion parameters!) in this Space or within the playground embedded below:

Under the hood, this playground uses Hugging Face’s Text Generation Inference, the identical technology that powers HuggingChat, and we’ll share more in the next sections.

If you desire to check out the larger instruct-tuned 34B model, it’s now available on HuggingChat! You’ll be able to try it out here: hf.co/chat. Make sure that to specify the Code Llama model. You too can check this chat-based demo and duplicate it on your use – it’s self-contained, so you may examine the source code and adapt it as you want!

Transformers

Starting with transformers 4.33, you should use Code Llama and leverage all of the tools inside the HF ecosystem, akin to:

training and inference scripts and examples
protected file format (safetensors)
integrations with tools akin to bitsandbytes (4-bit quantization) and PEFT (parameter efficient fine-tuning)
utilities and helpers to run generation with the model
mechanisms to export the models to deploy

!pip install --upgrade transformers

A Note on dtypes

When using models like Code Llama, it is vital to check out the information forms of the models.

32-bit floating point (float32): PyTorch convention on model initialization is to load models in float32, regardless of with which precision the model weights were stored. transformers also follows this convention for consistency with PyTorch.
16-bit Brain floating point (bfloat16): Code Llama was trained with this precision, so we recommend using it for further training or fine-tuning.
16-bit floating point (float16): We recommend running inference using this precision, as it’s always faster than bfloat16, and evaluation metrics show no discernible degradation with respect to bfloat16. You too can run inference using bfloat16, and we recommend you check inference results with each float16 and bfloat16 after fine-tuning.

As mentioned above, transformers loads weights using float32 (regardless of with which precision the models are stored), so it is vital to specify the specified dtype when loading the models. If you desire to fine-tune Code Llama, it’s really helpful to make use of bfloat16, as using float16 can result in overflows and NaNs. If you happen to run inference, we recommend using float16 because bfloat16 might be slower.

Code Completion

The 7B and 13B models might be used for text/code completion or infilling. The next code snippet uses the pipeline interface to show text completion. It runs on the free tier of Colab, so long as you choose a GPU runtime.

from transformers import AutoTokenizer
import transformers
import torch

tokenizer = AutoTokenizer.from_pretrained("codellama/CodeLlama-7b-hf")
pipeline = transformers.pipeline(
    "text-generation",
    model="codellama/CodeLlama-7b-hf",
    torch_dtype=torch.float16,
    device_map="auto",
)

sequences = pipeline(
    'def fibonacci(',
    do_sample=True,
    temperature=0.2,
    top_p=0.9,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=100,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

This may occasionally produce output like the next:

Result: def fibonacci(n):
    if n == 0:
        return 0
    elif n == 1:
        return 1
    else:
        return fibonacci(n-1) + fibonacci(n-2)

def fibonacci_memo(n, memo={}):
    if n == 0:
        return 0
    elif n == 1:
        return

Code Llama is specialized in code understanding, nevertheless it’s a language model in its own right. You should use the identical generation technique to autocomplete comments or general text.

Code Infilling

This can be a specialized task particular to code models. The model is trained to generate the code (including comments) that best matches an existing prefix and suffix. That is the strategy typically utilized by code assistants: they’re asked to fill the present cursor position, considering the contents that appear before and after it.

This task is on the market within the base and instruction variants of the 7B and 13B models. It’s not available for any of the 34B models or the Python versions.

To make use of this feature successfully, you might want to pay close attention to the format used to coach the model for this task, because it uses special separators to discover the various parts of the prompt. Fortunately, transformers’ CodeLlamaTokenizer makes this very easy, as demonstrated below:

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model_id = "codellama/CodeLlama-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16
).to("cuda")

prompt = '''def remove_non_ascii(s: str) -> str:
    """ 
    return result
'''

input_ids = tokenizer(prompt, return_tensors="pt")["input_ids"].to("cuda")
output = model.generate(
    input_ids,
    max_new_tokens=200,
)
output = output[0].to("cpu")

filling = tokenizer.decode(output[input_ids.shape[1]:], skip_special_tokens=True)
print(prompt.replace("", filling))

def remove_non_ascii(s: str) -> str:
    """ Remove non-ASCII characters from a string.

    Args:
        s: The string to remove non-ASCII characters from.

    Returns:
        The string with non-ASCII characters removed.
    """
    result = ""
    for c in s:
        if ord(c) < 128:
            result += c
    return result

Under the hood, the tokenizer routinely splits by to create a formatted input string that follows the unique training pattern. That is more robust than preparing the pattern yourself: it avoids pitfalls, akin to token glueing, which are very hard to debug.

Conversational Instructions

The bottom model might be used for each completion and infilling, as described. The Code Llama release also includes an instruction fine-tuned model that might be utilized in conversational interfaces.

To arrange inputs for this task we now have to make use of a prompt template just like the one described in our Llama 2 blog post, which we reproduce again here:

[INST] <>
{{ system_prompt }}
<>

{{ user_msg_1 }} [/INST] {{ model_answer_1 }} [INST] {{ user_msg_2 }} [/INST]

Note that the system prompt is optional – the model will work without it, but you should use it to further configure its behavior or style. For instance, should you’d at all times wish to get answers in JavaScript, you possibly can state that here. After the system prompt, you might want to provide all of the previous interactions within the conversation: what was asked by the user and what was answered by the model. As within the infilling case, you might want to concentrate to the delimiters used. The ultimate component of the input must at all times be a brand new user instruction, which shall be the signal for the model to supply a solution.

The next code snippets show how the template works in practice.

First user query, no system prompt

user = 'In Bash, how do I list all text files in the present directory (excluding subdirectories) which were modified within the last month?'

prompt = f"[INST] {user.strip()} [/INST]"
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")

First user query with system prompt

system = "Provide answers in JavaScript"
user = "Write a function that computes the set of sums of all contiguous sublists of a given list."

prompt = f"[INST] <>n{system}n<>nn{user}[/INST]"
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")

On-going conversation with previous answers

The method is similar as in Llama 2. We haven’t used loops or generalized this instance code for max clarity:

system = "System prompt"
user_1 = "user_prompt_1"
answer_1 = "answer_1"
user_2 = "user_prompt_2"
answer_2 = "answer_2"
user_3 = "user_prompt_3"

prompt  = f"<>n{system}n<>nn{user_1}"
prompt  = f"[INST] {prompt.strip()} [/INST] {answer_1.strip()} "
prompt += f"[INST] {user_2.strip()} [/INST] {answer_2.strip()} "
prompt += f"[INST] {user_3.strip()} [/INST]"

inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")

4-bit Loading

Integration of Code Llama in Transformers signifies that you get immediate support for advanced features like 4-bit loading. This permits you to run the massive 32B parameter models on consumer GPUs like nvidia 3090 cards!

Here’s how you may run inference in 4-bit mode:

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

model_id = "codellama/CodeLlama-34b-hf"
quantization_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_compute_dtype=torch.float16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=quantization_config,
    device_map="auto",
)

prompt = 'def remove_non_ascii(s: str) -> str:n    """ '
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

output = model.generate(
    inputs["input_ids"],
    max_new_tokens=200,
    do_sample=True,
    top_p=0.9,
    temperature=0.1,
)
output = output[0].to("cpu")
print(tokenizer.decode(output))

Using text-generation-inference and Inference Endpoints

Text Generation Inference is a production-ready inference container developed by Hugging Face to enable easy deployment of huge language models. It has features akin to continuous batching, token streaming, tensor parallelism for fast inference on multiple GPUs, and production-ready logging and tracing.

You’ll be able to check out Text Generation Inference on your individual infrastructure, or you should use Hugging Face’s Inference Endpoints. To deploy a Codellama 2 model, go to the model page and click on on the Deploy -> Inference Endpoints widget.

For 7B models, we advise you to pick “GPU [medium] – 1x Nvidia A10G”.
For 13B models, we advise you to pick “GPU [xlarge] – 1x Nvidia A100”.
For 34B models, we advise you to pick “GPU [1xlarge] – 1x Nvidia A100” with bitsandbytes quantization enabled or “GPU [2xlarge] – 2x Nvidia A100”

Note: You may must request a quota upgrade via email to api-enterprise@huggingface.co to access A100s

You’ll be able to learn more on how you can Deploy LLMs with Hugging Face Inference Endpoints in our blog. The blog includes details about supported hyperparameters and how you can stream your response using Python and Javascript.

Using VS Code extension

HF Code Autocomplete is a VS Code extension for testing open source code completion models. The extension was developed as a part of StarCoder project and was updated to support the medium-sized base model, Code Llama 13B. Find more here on how you can install and run the extension with Code Llama.

Evaluation

Language models for code are typically benchmarked on datasets akin to HumanEval. It consists of programming challenges where the model is presented with a function signature and a docstring and is tasked to finish the function body. The proposed solution is then verified by running a set of predefined unit tests. Finally, a pass rate is reported which describes what number of solutions passed all tests. The pass@1 rate describes how often the model generates a passing solution when having one shot whereas pass@10 describes how often at the very least one solution passes out of 10 proposed candidates.

While HumanEval is a Python benchmark there have been significant efforts to translate it to more programming languages and thus enable a more holistic evaluation. One such approach is MultiPL-E which translates HumanEval to over a dozen languages. We’re hosting a multilingual code leaderboard based on it to permit the community to check models across different languages to judge which model matches their use-case best.

Model	License	Dataset known	Industrial use?	Pretraining length [tokens]	Python	JavaScript	Leaderboard Avg Rating
CodeLlaMa-34B	Llama 2 license	❌	✅	2,500B	45.11	41.66	33.89
CodeLlaMa-13B	Llama 2 license	❌	✅	2,500B	35.07	38.26	28.35
CodeLlaMa-7B	Llama 2 license	❌	✅	2,500B	29.98	31.8	24.36
CodeLlaMa-34B-Python	Llama 2 license	❌	✅	2,620B	53.29	44.72	33.87
CodeLlaMa-13B-Python	Llama 2 license	❌	✅	2,620B	42.89	40.66	28.67
CodeLlaMa-7B-Python	Llama 2 license	❌	✅	2,620B	40.48	36.34	23.5
CodeLlaMa-34B-Instruct	Llama 2 license	❌	✅	2,620B	50.79	45.85	35.09
CodeLlaMa-13B-Instruct	Llama 2 license	❌	✅	2,620B	50.6	40.91	31.29
CodeLlaMa-7B-Instruct	Llama 2 license	❌	✅	2,620B	45.65	33.11	26.45
StarCoder-15B	BigCode-OpenRail-M	✅	✅	1,035B	33.57	30.79	22.74
StarCoderBase-15B	BigCode-OpenRail-M	✅	✅	1,000B	30.35	31.7	22.4
WizardCoder-15B	BigCode-OpenRail-M	❌	✅	1,035B	58.12	41.91	32.07
OctoCoder-15B	BigCode-OpenRail-M	✅	✅	1,000B	45.3	32.8	24.01
CodeGeeX-2-6B	CodeGeeX License	❌	❌	2,000B	33.49	29.9	21.23
CodeGen-2.5-7B-Mono	Apache-2.0	✅	✅	1400B	45.65	23.22	12.1
CodeGen-2.5-7B-Multi	Apache-2.0	✅	✅	1400B	28.7	26.27	20.04

Note: The scores presented within the table above were sourced from our code leaderboard on the time of publication. Scores change as recent models are released, because models are compared against each other. For more details, please consult with the leaderboard.

Additional Resources

Source link

Llama 2 learns to code

Table of Contents

What’s Code Llama?

Methods to use Code Llama?

Demo

Transformers

A Note on dtypes

Code Completion

Code Infilling

Conversational Instructions

4-bit Loading

Using text-generation-inference and Inference Endpoints

Using VS Code extension

Evaluation

Additional Resources

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

5 Ways to Implement Variable Discretization

Stop Tuning Hyperparameters. Start Tuning Your Problem.

Tuning Flash Attention for Peak Performance in NVIDIA CUDA Tile

Bridging the operational AI gap

Escaping the Prototype Mirage: Why Enterprise AI Stalls

Llama 2 learns to code

Table of Contents

What’s Code Llama?

Methods to use Code Llama?

Demo

Transformers

A Note on dtypes

Code Completion

Code Infilling

Conversational Instructions

4-bit Loading

Using text-generation-inference and Inference Endpoints

Using VS Code extension

Evaluation

Additional Resources

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.