Home Artificial Intelligence Meet Vicuna: The Latest Meta’s Llama Model that Matches ChatGPT Performance The Architecture Evaluation Testing Vicuna

Meet Vicuna: The Latest Meta’s Llama Model that Matches ChatGPT Performance The Architecture Evaluation Testing Vicuna

3
Meet Vicuna: The Latest Meta’s Llama Model that Matches ChatGPT Performance
The Architecture
Evaluation
Testing Vicuna

Created Using Midjourney

I recently began an AI-focused educational newsletter, that already has over 150,000 subscribers. TheSequence is a no-BS (meaning no hype, no news etc) ML-oriented newsletter that takes 5 minutes to read. The goal is to maintain you up to this point with machine learning projects, research papers and ideas. Please give it a try by subscribing below:

Since its release, Meta AI’s Llama has turn out to be the muse to all types of conversational AI models. Stanford’s Alpaca and Databricks’ Dolly are among the latest foundational models built on top of Llama. All of them appear to have names related to…we… llamas. The most recent addition to the list is Vicuna, a collaboration between researchers from UC Berkeley, CMU, Stanford, and UC San Diego.

Vicuna-13B is a latest open-source chatbot that has been developed to deal with the shortage of coaching and architecture details in existing large language models (LLMs) equivalent to OpenAI’s ChatGPT. Vicuna-13B is trained by fine-tuning a LLaMA base model using roughly 70,000 user-shared conversations gathered from ShareGPT.com, leading to an enhanced dataset. The preliminary evaluation of Vicuna-13B using GPT-4 as a judge shows that it achieves over 90% quality of OpenAI ChatGPT and Google Bard while outperforming other models like LLaMA and Stanford Alpaca in greater than 90% of cases.

Vicuna is an open-source chatbot that has been fine-tuned from a LLaMA base model using roughly 70,000 user-shared conversations collected from ShareGPT.com with public APIs. To make sure data quality, the research team converted the HTML back to markdown and filtered out inappropriate or low-quality samples. Additionally they divided lengthy conversations into smaller segments that fit the model’s maximum context length.

Image Credit: https://vicuna.lmsys.org/

The research team built upon Stanford’s Alpaca training recipe to optimize Vicuna’s performance with several key improvements, including:

The team expanded the max context length from 512 in alpaca to 2048 to enable a greater understanding of long conversations. Nonetheless, this substantially increased GPU memory requirements, so the team utilized gradient checkpointing and flash attention to tackling the memory pressure.

The team adjusted the training loss to account for multi-round conversations and computed the fine-tuning loss solely on the chatbot’s output.

With a 40x larger dataset and 4x sequence length for training, training expenses posed a substantial challenge. To scale back the price, the team employed SkyPilot managed spot to leverage cheaper spot instances with auto-recovery for preemptions and auto zone switch.

These optimizations contribute to Vicuna’s ability to know and reply to complex conversations, while the price reduction strategies make it a reasonable option for researchers and developers seeking to construct chatbot systems.

To coach Vicuna, the research team collected around 70,000 conversations from ShareGPT.com, a web site where users can share their ChatGPT conversations. They then enhanced the training scripts provided by Alpaca to raised handle multi-round conversations and long sequences. The team used PyTorch FSDP on 8 A100 GPUs to coach Vicuna in only sooner or later.

To serve the demo, the team implemented a light-weight distributed serving system able to serving multiple models with distributed staff. This method supports flexible plug-in of GPU staff from each on-premise clusters and the cloud. The team utilized a fault-tolerant controller and managed spot features in SkyPilot to cut back serving costs by leveraging cheaper spot instances from multiple clouds.

Evaluating AI chatbots could be a difficult task because it requires assessing language understanding, reasoning, and context awareness. As AI chatbots turn out to be more advanced, current open benchmarks may now not be sufficient. For instance, the evaluation dataset utilized in Stanford’s Alpaca, self-instruct, may be answered by state-of-the-art chatbots, making it difficult for humans to discern differences in performance. As well as, creating latest benchmarks may be costly, and there could also be issues with training/test data contamination.

To handle these issues, the research team proposes an evaluation framework based on GPT-4 to automate chatbot performance assessment. The framework consists of eight query categories, including Fermi problems, roleplay scenarios, and coding/math tasks designed to check various facets of a chatbot’s performance. By fastidiously engineering prompts, GPT-4 generates diverse and difficult questions that baseline models struggle with. The team selects ten questions per category and collects answers from five chatbots, including LLaMA, Alpaca, ChatGPT, Bard, and Vicuna.

The team then asks GPT-4 to rate the standard of the chatbots’ answers based on helpfulness, relevance, accuracy, and detail. GPT-4 produces relatively consistent scores and provides detailed explanations of why such scores are given. Nonetheless, the team notes that GPT-4 shouldn’t be excellent at judging coding/math tasks.

Image Credit: https://vicuna.lmsys.org/

Along with the source code, the research published a demo of Vicuna-13B at https://chat.lmsys.org/

Image Credit: https://vicuna.lmsys.org/

Overall, this evaluation framework offers a promising approach to assessing chatbot performance in a consistent and automatic manner. The team’s use of diverse query categories and careful prompt engineering highlights the potential for this framework to uncover differences in chatbot performance that might not be easily discernible through human evaluation.

3 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here