Meet Vicuna: The Latest Meta’s Llama Model that Matches ChatGPT Performance The Architecture Evaluation Testing Vicuna


Created Using Midjourney

I recently began an AI-focused educational newsletter, that already has over 150,000 subscribers. TheSequence is a no-BS (meaning no hype, no news etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to maintain you up up to now with machine learning projects, research papers and ideas. Please give it a try by subscribing below:

Since its release, Meta AI’s Llama has develop into the inspiration to all styles of conversational AI models. Stanford’s Alpaca and Databricks’ Dolly are a number of the latest foundational models built on top of Llama. All of them appear to have names related to…we… llamas. The newest addition to the list is Vicuna, a collaboration between researchers from UC Berkeley, CMU, Stanford, and UC San Diego.

Vicuna-13B is a latest open-source chatbot that has been developed to handle the shortage of coaching and architecture details in existing large language models (LLMs) resembling OpenAI’s ChatGPT. Vicuna-13B is trained by fine-tuning a LLaMA base model using roughly 70,000 user-shared conversations gathered from, leading to an enhanced dataset. The preliminary evaluation of Vicuna-13B using GPT-4 as a judge shows that it achieves over 90% quality of OpenAI ChatGPT and Google Bard while outperforming other models like LLaMA and Stanford Alpaca in greater than 90% of cases.

Vicuna is an open-source chatbot that has been fine-tuned from a LLaMA base model using roughly 70,000 user-shared conversations collected from with public APIs. To make sure data quality, the research team converted the HTML back to markdown and filtered out inappropriate or low-quality samples. Additionally they divided lengthy conversations into smaller segments that fit the model’s maximum context length.

Image Credit:

The research team built upon Stanford’s Alpaca training recipe to optimize Vicuna’s performance with several key improvements, including:

The team expanded the max context length from 512 in alpaca to 2048 to enable a greater understanding of long conversations. Nevertheless, this substantially increased GPU memory requirements, so the team utilized gradient checkpointing and flash attention to tackling the memory pressure.

The team adjusted the training loss to account for multi-round conversations and computed the fine-tuning loss solely on the chatbot’s output.

With a 40x larger dataset and 4x sequence length for training, training expenses posed a substantial challenge. To scale back the fee, the team employed SkyPilot managed spot to leverage cheaper spot instances with auto-recovery for preemptions and auto zone switch.

These optimizations contribute to Vicuna’s ability to know and reply to complex conversations, while the fee reduction strategies make it an inexpensive option for researchers and developers seeking to construct chatbot systems.

To coach Vicuna, the research team collected around 70,000 conversations from, a web site where users can share their ChatGPT conversations. They then enhanced the training scripts provided by Alpaca to raised handle multi-round conversations and long sequences. The team used PyTorch FSDP on 8 A100 GPUs to coach Vicuna in only at some point.

To serve the demo, the team implemented a light-weight distributed serving system able to serving multiple models with distributed staff. This technique supports flexible plug-in of GPU staff from each on-premise clusters and the cloud. The team utilized a fault-tolerant controller and managed spot features in SkyPilot to cut back serving costs by leveraging cheaper spot instances from multiple clouds.

Evaluating AI chatbots is usually a difficult task because it requires assessing language understanding, reasoning, and context awareness. As AI chatbots develop into more advanced, current open benchmarks may now not be sufficient. For instance, the evaluation dataset utilized in Stanford’s Alpaca, self-instruct, could be answered by state-of-the-art chatbots, making it difficult for humans to discern differences in performance. As well as, creating latest benchmarks could be costly, and there could also be issues with training/test data contamination.

To handle these issues, the research team proposes an evaluation framework based on GPT-4 to automate chatbot performance assessment. The framework consists of eight query categories, including Fermi problems, roleplay scenarios, and coding/math tasks designed to check various facets of a chatbot’s performance. By fastidiously engineering prompts, GPT-4 generates diverse and difficult questions that baseline models struggle with. The team selects ten questions per category and collects answers from five chatbots, including LLaMA, Alpaca, ChatGPT, Bard, and Vicuna.

The team then asks GPT-4 to rate the standard of the chatbots’ answers based on helpfulness, relevance, accuracy, and detail. GPT-4 produces relatively consistent scores and provides detailed explanations of why such scores are given. Nevertheless, the team notes that GPT-4 isn’t excellent at judging coding/math tasks.

Image Credit:

Along with the source code, the research published a demo of Vicuna-13B at

Image Credit:

Overall, this evaluation framework offers a promising approach to assessing chatbot performance in a consistent and automatic manner. The team’s use of diverse query categories and careful prompt engineering highlights the potential for this framework to uncover differences in chatbot performance that will not be easily discernible through human evaluation.


What are your thoughts on this topic?
Let us know in the comments below.


0 0 votes
Article Rating
1 Comment
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

Would love your thoughts, please comment.x