A State-of-the-Art LLM for Code

StarCoder and StarCoderBase are Large Language Models for Code (Code LLMs) trained on permissively licensed data from GitHub, including from 80+ programming languages, Git commits, GitHub issues, and Jupyter notebooks. Much like LLaMA, we trained a ~15B parameter model for 1 trillion tokens. We fine-tuned StarCoderBase model for 35B Python tokens, leading to a brand new model that we call StarCoder.

We found that StarCoderBase outperforms existing open Code LLMs on popular programming benchmarks and matches or surpasses closed models reminiscent of code-cushman-001 from OpenAI (the unique Codex model that powered early versions of GitHub Copilot). With a context length of over 8,000 tokens, the StarCoder models can process more input than some other open LLM, enabling a big selection of interesting applications. For instance, by prompting the StarCoder models with a series of dialogues, we enabled them to act as a technical assistant. As well as, the models may be used to autocomplete code, make modifications to code via instructions, and explain a code snippet in natural language.
We take several vital steps towards a protected open model release, including an improved PII redaction pipeline, a novel attribution tracing tool, and make StarCoder publicly available
under an improved version of the OpenRAIL license. The updated license simplifies the method for corporations to integrate the model into their products. We imagine that with its strong performance, the StarCoder models will function a solid foundation for the community to make use of and adapt it to their use-cases and products.

Evaluation

We thoroughly evaluated StarCoder and a number of other similar models and quite a lot of benchmarks. A preferred Python benchmark is HumanEval which tests if the model can complete functions based on their signature and docstring. We found that each StarCoder and StarCoderBase outperform the biggest models, including PaLM, LaMDA, and LLaMA, despite being significantly smaller. Additionally they outperform CodeGen-16B-Mono and OpenAI’s code-cushman-001 (12B) model. We also noticed that a failure case of the model was that it could produce # Solution here code, probably because that form of code is generally a part of exercise. To force the model the generate an actual solution we added the prompt solutions/solution_1.pyn# Here is the right implementation of the code exercise. This significantly increased the HumanEval rating of StarCoder from 34% to over 40%, setting a brand new state-of-the-art result for open models. We also tried this prompt for CodeGen and StarCoderBase but didn’t observe much difference.

Model	HumanEval	MBPP
LLaMA-7B	10.5	17.7
LaMDA-137B	14.0	14.8
LLaMA-13B	15.8	22.0
CodeGen-16B-Multi	18.3	20.9
LLaMA-33B	21.7	30.2
CodeGeeX	22.9	24.4
LLaMA-65B	23.7	37.7
PaLM-540B	26.2	36.8
CodeGen-16B-Mono	29.3	35.3
StarCoderBase	30.4	49.0
code-cushman-001	33.5	45.9
StarCoder	33.6	52.7
StarCoder-Prompted	40.8	49.5

An interesting aspect of StarCoder is that it’s multilingual and thus we evaluated it on MultiPL-E which extends HumanEval to many other languages. We observed that StarCoder matches or outperforms code-cushman-001 on many languages. On an information science benchmark called DS-1000 it clearly beats it in addition to all other open-access models. But let’s examine what else the model can do besides code completion!

Tech Assistant

With the exhaustive evaluations we found that StarCoder could be very capable at writing code. But we also desired to test if it may possibly be used as a tech assistant, in spite of everything it was trained on numerous documentation and GitHub issues. Inspired by Anthropic’s HHH prompt we built a Tech Assistant Prompt. Surprisingly, with just the prompt the model is in a position to act as a tech assistant and answer programming related requests!

Training data

The model was trained on a subset of The Stack 1.2. The dataset only consists of permissively licensed code and includes an opt-out process such that code contributors can remove their data from the dataset (see Am I in The Stack). In collaboration with Toloka, we removed Personal Identifiable Information from the training data reminiscent of Names, Passwords, and Email addresses.

About BigCode

BigCode is an open scientific collaboration led jointly by Hugging Face and ServiceNow that works on the responsible development of huge language models for code.

Additional releases

Together with the model, we’re releasing an inventory of resources and demos:

the model weights, including intermediate checkpoints with OpenRAIL license
all code for data preprocessing and training with Apache 2.0 license
a comprehensive evaluation harness for code models
a brand new PII dataset for training and evaluating PII removal
the fully preprocessed dataset used for training
a code attribution tool for locating generated code within the dataset

A State-of-the-Art LLM for Code

Evaluation

Tech Assistant

Training data

About BigCode

Additional releases

Links

Models

Tools & Demos

Data & Governance

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

A defense official reveals how AI chatbots might be used for targeting decisions

Solving the Human Training Data Problem

How We Hit #1 on DABStep with Reusable Tool Generation

Can AI help predict which heart-failure patients will worsen inside a 12 months?

The who, what, and why of the attack that has shut down Stryker’s Windows network”

A State-of-the-Art LLM for Code

Evaluation

Tech Assistant

Training data

About BigCode

Additional releases

Links

Models

Tools & Demos

Data & Governance

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.