Meet Koala: Berkeley University’s LLaMA-Based Model Wonderful-Tuned with ChatGPT Dialogues Koala EasyLM


Created Using Midjourney

I recently began an AI-focused educational newsletter, that already has over 150,000 subscribers. TheSequence is a no-BS (meaning no hype, no news etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to maintain you up up to now with machine learning projects, research papers and ideas. Please give it a try by subscribing below:

The accidental leak of the weights related to Meta AI’s LLM LLaMA has sparked an amazing level of innovation within the open-source LLM space. Because the furtious leak, we’ve seen models like Alpaca, Vicuna, ChatLlama and several other others expand on the foundations of LLaMA to construct revolutionary conversational agents that match the capabilities of ChatGPT. Considered one of the newest addition to the list is Koala( yes I do know, one other animal-named model), a chatbot created by Berkeley AI Research(BAIR) that fine-tunes LLaMA on conversations gathered from the web.

The core goal of Koala is to focus on the balance between mega-large closed-source LLMs and smaller, open-source LLMs. BAIR’s thesis is that smaller models can achieve performance that matches mega models like ChatGPT with a fraction of the price while also improving in areas equivalent to fine-tuning, transparency, and plenty of others.

Koala is a version of LlaMA fine-tuned on dialogue data scraped from the online and public datasets, including high-quality responses to user queries from other large language models, in addition to question-answering datasets and human feedback datasets. Koala has been specifically trained on interaction data scraped from the online, with a concentrate on data that features interaction with highly capable closed-source models equivalent to ChatGPT. The resulting model, Koala-13B, demonstrates competitive performance to existing models based on human evaluation of real-world user prompts.

Image Credit: BAIR

The outcomes suggest that using high-quality datasets can overcome a number of the limitations of smaller models and should even match the capabilities of enormous closed-source models in the long run. The research team recommends that the community should prioritize curating high-quality datasets, as this will likely enable the creation of safer, more factual, and more capable models than simply increasing the dimensions of existing systems.

Considered one of the interesting points of Koala was the information sources used for training. The fine-tuning datasets include data curated from ChatGPT dialogs. The fine-tuning strategy included the next datasets:

Around 60K dialogues shared by users on ShareGPT were collected through public APIs. To make sure data quality, the team deduplicated to the user-query level and removed non-English conversations. The resulting dataset comprises roughly 30K examples.

The team used the human and ChatGPT responses from the HC3 English dataset, which incorporates roughly 60K human answers and 27K ChatGPT answers for roughly 24K questions. This ends in a complete of around 87K question-answer examples.

A subset of components from the Open Instruction Generalist dataset curated by LAION was used, including the grade-school-math-instructions, the poetry-to-songs, and the plot-screenplay-books-dialogue datasets. The chosen subset ends in a complete of around 30K examples.

The team included the dataset used to coach the Stanford Alpaca model, which accommodates roughly 52K examples generated by OpenAI’s text-davinci-003 through the self-instruct process. It’s value noting that HC3, OIG, and Alpaca datasets are single-turn query answering while ShareGPT dataset is dialogue conversations.

The team utilized the Anthropic HH dataset, which incorporates around 160K human-rated examples. Each example consists of a pair of responses from a chatbot, considered one of which is preferred by humans. The dataset provides each capabilities and extra safety protections for the model.

The OpenAI WebGPT dataset includes roughly 20K comparisons where each example comprises a matter, a pair of model answers, and metadata. The answers are rated by humans with a preference rating.

The OpenAI summarization dataset accommodates roughly 93K examples, each example consisting of feedback from humans regarding the summarizations generated by a model. Human evaluators selected the superior summary from two options.

A comparison between Koala, ChatGPT and open source models like Alpaca will be seen in the next matrix:

Image Credit: BAIR

Considered one of the important thing contributions of the Koala research was the open-source release of EasyLM, the framework used for fine-tuning the model. Conceptually, EasyLM is an answer designed to pre-train, fine-tune, evaluate, and serve LLMs in JAX/Flax. Leveraging JAX’s pjit functionality, EasyLM can scale up LLM training to a whole bunch of TPU/GPU accelerators.

EasyLM is built on top of Hugginface’s transformers and datasets, providing a user-friendly and customizable codebase for training large language models without the complexity of many other frameworks. By utilizing JAX’s pjit utility, EasyLM can train large models that don’t fit on a single accelerator by sharding the model weights and training data across multiple accelerators. Currently, EasyLM supports multiple TPU/GPU training in a single host in addition to multi-host training on Google Cloud TPU Pods.

Koala, however, was trained on a single Nvidia DGX server equipped with 8 A100 GPUs. The training process took roughly 6 hours to finish for two epochs. The sort of training run typically costs lower than $100 on public cloud computing platforms using preemptible instances.

The open-source release of Koala got here accompanied by a web based demo and the code for preprocessing the training data.

Koala represents an interesting iteration of the LlaMA models and one which sheds some light on the viability of smaller open-source alternatives to ChatGPT-like models.


What are your thoughts on this topic?
Let us know in the comments below.


0 0 votes
Article Rating
1 Comment
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

Would love your thoughts, please comment.x