StarCoder and StarCoderBase are Large Language Models for Code (Code LLMs) trained on permissively licensed data from GitHub, including from 80+ programming languages, Git commits, GitHub issues, and Jupyter notebooks. Much like LLaMA, we trained a ~15B parameter model for 1 trillion tokens. We fine-tuned StarCoderBase model for 35B Python tokens, leading to a brand new model that we call StarCoder.
We found that StarCoderBase outperforms existing open Code LLMs on popular programming benchmarks and matches or surpasses closed models reminiscent of code-cushman-001 from OpenAI (the unique Codex model that powered early versions of GitHub Copilot). With a context length of over 8,000 tokens, the StarCoder models can process more input than some other open LLM, enabling a big selection of interesting applications. For instance, by prompting the StarCoder models with a series of dialogues, we enabled them to act as a technical assistant. As well as, the models may be used to autocomplete code, make modifications to code via instructions, and explain a code snippet in natural language.
We take several vital steps towards a protected open model release, including an improved PII redaction pipeline, a novel attribution tracing tool, and make StarCoder publicly available
under an improved version of the OpenRAIL license. The updated license simplifies the method for corporations to integrate the model into their products. We imagine that with its strong performance, the StarCoder models will function a solid foundation for the community to make use of and adapt it to their use-cases and products.
Evaluation
We thoroughly evaluated StarCoder and a number of other similar models and quite a lot of benchmarks. A preferred Python benchmark is HumanEval which tests if the model can complete functions based on their signature and docstring. We found that each StarCoder and StarCoderBase outperform the biggest models, including PaLM, LaMDA, and LLaMA, despite being significantly smaller. Additionally they outperform CodeGen-16B-Mono and OpenAI’s code-cushman-001 (12B) model. We also noticed that a failure case of the model was that it could produce # Solution here code, probably because that form of code is generally a part of exercise. To force the model the generate an actual solution we added the prompt . This significantly increased the HumanEval rating of StarCoder from 34% to over 40%, setting a brand new state-of-the-art result for open models. We also tried this prompt for CodeGen and StarCoderBase but didn’t observe much difference.
| Model | HumanEval | MBPP |
|---|---|---|
| LLaMA-7B | 10.5 | 17.7 |
| LaMDA-137B | 14.0 | 14.8 |
| LLaMA-13B | 15.8 | 22.0 |
| CodeGen-16B-Multi | 18.3 | 20.9 |
| LLaMA-33B | 21.7 | 30.2 |
| CodeGeeX | 22.9 | 24.4 |
| LLaMA-65B | 23.7 | 37.7 |
| PaLM-540B | 26.2 | 36.8 |
| CodeGen-16B-Mono | 29.3 | 35.3 |
| StarCoderBase | 30.4 | 49.0 |
| code-cushman-001 | 33.5 | 45.9 |
| StarCoder | 33.6 | 52.7 |
| StarCoder-Prompted | 40.8 | 49.5 |
An interesting aspect of StarCoder is that it’s multilingual and thus we evaluated it on MultiPL-E which extends HumanEval to many other languages. We observed that StarCoder matches or outperforms code-cushman-001 on many languages. On an information science benchmark called DS-1000 it clearly beats it in addition to all other open-access models. But let’s examine what else the model can do besides code completion!
Tech Assistant
With the exhaustive evaluations we found that StarCoder could be very capable at writing code. But we also desired to test if it may possibly be used as a tech assistant, in spite of everything it was trained on numerous documentation and GitHub issues. Inspired by Anthropic’s HHH prompt we built a Tech Assistant Prompt. Surprisingly, with just the prompt the model is in a position to act as a tech assistant and answer programming related requests!
Training data
The model was trained on a subset of The Stack 1.2. The dataset only consists of permissively licensed code and includes an opt-out process such that code contributors can remove their data from the dataset (see Am I in The Stack). In collaboration with Toloka, we removed Personal Identifiable Information from the training data reminiscent of Names, Passwords, and Email addresses.
About BigCode
BigCode is an open scientific collaboration led jointly by Hugging Face and ServiceNow that works on the responsible development of huge language models for code.
Additional releases
Together with the model, we’re releasing an inventory of resources and demos:
- the model weights, including intermediate checkpoints with OpenRAIL license
- all code for data preprocessing and training with Apache 2.0 license
- a comprehensive evaluation harness for code models
- a brand new PII dataset for training and evaluating PII removal
- the fully preprocessed dataset used for training
- a code attribution tool for locating generated code within the dataset
Links
Models
- Paper: A technical report about StarCoder.
- GitHub: All that you must learn about using or fine-tuning StarCoder.
- StarCoder: StarCoderBase further trained on Python.
- StarCoderBase: Trained on 80+ languages from The Stack.
- StarEncoder: Encoder model trained on TheStack.
- StarPii: StarEncoder based PII detector.
Tools & Demos
Data & Governance
You will discover all of the resources and links at huggingface.co/bigcode!

