Home Artificial Intelligence Inside StarCoder: The Latest Open Source LLM that Can Generative Code in Over 80 Programming Languages The Architecture Evaluation Using StarCoder The Tools

Inside StarCoder: The Latest Open Source LLM that Can Generative Code in Over 80 Programming Languages The Architecture Evaluation Using StarCoder The Tools

1
Inside StarCoder: The Latest Open Source LLM that Can Generative Code in Over 80 Programming Languages
The Architecture
Evaluation
Using StarCoder
The Tools

The brand new project is a component of the BigCode initiative by Hugging Face and ServiceNow.

Created Using Midjourney

I recently began an AI-focused educational newsletter, that already has over 150,000 subscribers. TheSequence is a no-BS (meaning no hype, no news etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to maintain you up up to now with machine learning projects, research papers and ideas. Please give it a try by subscribing below:

Coding is one of the crucial interesting applications of contemporary large language models(LLMs). Programming is an issue significatively more complex than other language tasks on condition that it involves different types of reasoning. Nevertheless, progress on this area has been clearly visible in the previous couple of years.

GitHub CoPilot has change into the gold standard for the appliance of AI to programming, nevertheless it’s actually not the just one. Amazon recently entered the race with Code Whisperer. Salesforce has been super energetic within the space with solutions corresponding to CodeGen. Most of those solutions remained close source. Recently, Hugging Face and ServiceNow announced StarCoder, a latest open source LLM for coding that matches the performance of GPT-4. StarCoder is a component of a bigger collaboration often known as the BigCode project.

The BigCode project was initiated as an open-scientific initiative with the goal of responsibly developing LLMs for code. Hugging Face and ServiceNow jointly oversee BigCode, which has brought together over 600 members from a wide selection of educational institutions and industry labs. The community comprises various working groups that think about topics corresponding to collecting datasets, implementing rapid inference methods, creating an evaluation suite, and establishing ethical best practices for these models. The community previously released The Stack, a 6.4 TB dataset of permissively licensed source code in 384 programming languages, which included 54 GB of GitHub issues and repository-level metadata within the v1.2 version. To help developers in identifying if their source code is included within the dataset, The Stack comes with “Am I in The Stack”, a governance tool, and an opt-out process for individuals who want to have their code removed. The community also released SantaCoder, a 1.1B parameter model that excels at Java, JavaScript, and Python code from The Stack in December 2022.

StarCoder

The technical report outlines the efforts made to develop StarCoder and StarCoderBase, two 15.5B parameter models trained on permissively licensed data from The Stack. StarCoder-Base was trained on over 1 trillion tokens derived from greater than 80 programming languages, GitHub issues, Git commits, and Jupyter notebooks. StarCoderBase was further fine-tuned on an extra 35B Python tokens, leading to the creation of the StarCoder model. Each StarCoder models employ progressive architectural features, corresponding to an 8K context length, infilling capabilities through Fill-in-the-Middle (FIM), and fast large-batch inference using Multi-Query-Attention (MQA). The technical report provides a comprehensive evaluation of the StarCoder models and features a demo with an integrated attribution tool that aids users in identifying model generations which will have been copied from the training set.

Architecturally, StarCoder is a 15.5B parameter model was trained using the identical architecture as SantaCoder. It’s a decoder-only Transformer that comes with Fill-in-the-Middle, Multi-Query-Attention, and learned absolute positional embeddings. For every training document, FIM was utilized as a random transformation of the input sequence, dividing the document uniformly at random into three sections: prefix, middle, and suffix. Each section was given a sentinel token after which the document was rearranged to put the center section at the tip of the sequence. The autoregressive training objective remained unaltered. Context-level FIM was used, and transformations were applied on the character level.

An architectural change to the transformer often known as Multi Query Attention (MQA) was implemented, whereby key and value embeddings are shared across attention heads. This alteration reduces the memory bandwidth demands at generation time and leads to faster inference in comparison with Multi Head Attention (MHA).

FlashAttention was used to speed up the eye computation and reduce its memory footprint, enabling scaling to a context length of 8K. During training, the important thing and value were simply expanded before invoking the eye kernel to make FlashAttention function with MQA.

A comprehensive evaluation of StarCoder and various similar models was conducted using a variety of benchmarks. One commonly used Python benchmark is HumanEval, which assesses whether the model can complete functions based on their signature and docstring. It was discovered that each StarCoder and StarCoderBase outperformed the most important models, corresponding to PaLM, LaMDA, and LLaMA, despite their significantly smaller size.

StarCoder is integrated into Hugging Face’s Transformer library. The experience for using the model is simplified to a number of lines of code:

from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "bigcode/starcoder"
device = "cuda" # for GPU usage or "cpu" for CPU usage

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# to avoid wasting memory think about using fp16 or bf16 by specifying torch.dtype=torch.float16 for instance
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)

inputs = tokenizer.encode("def print_hello_world():", return_tensors="pt").to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0]))

Similarly, the model will be easily fine-tuned.

Along with the model, Hugging Face and ServiceNow open sourced a series of tools that streamline its adoption.

The Tech Assistant Prompt is a tool optimized for assisting developers in programming related tasks. Similarly, StarCoder Playground allow developers to generative code snippets from natural language inputs. The mode features a VSCode Extension that allows its integration into traditional development pipelines. The StarCoder Chat provides a conversational experience about programming related topics.

StarCoder is one of the crucial complete coding foundation models ever created and one which can definitely challenge GPT-4.

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here