Zephyr-7B : HuggingFace’s Hyper-Optimized LLM Built on Top of Mistral 7B

Artificial Intelligence

Zephyr-7B : HuggingFace’s Hyper-Optimized LLM Built on Top of Mistral 7B

admin

November 25, 2023

Zephyr-7B : HuggingFace’s Hyper-Optimized LLM Built on Top of Mistral 7B

Introduction

The evolution of open large language models (LLMs) has significantly impacted the AI research community, particularly in developing chatbots and similar applications. Following the discharge of models like LLaMA, there’s been a surge in research on efficient fine-tuning, prolonged prompt handling, retrieval augmented generation (RAG), and quantization.

The LLaMA model, as an illustration, marked a recent era in fine-tuning and prompt contextualization, paving the way in which for subsequent models like MosaicML’s MPT, Together AI’s RedPajama-INCITE, TII’s Falcon, and Meta’s Llama 2. Each of those models contributes unique capabilities, enhancing the general functionality and scope of LLMs.

Mistral AI, a startup from Paris and founded by former Google DeepMind and Meta employees, has made a reputation for itself with its first offering: Mistral 7B.

Mistral 7B’s edge lies in its efficiency, delivering similar or enhanced capabilities in comparison with peers like Llama 2 but with less computational demand.

Specifically tuned for instructional tasks, Mistral 7B Instruct shines on platforms like Hugging Face, where it surpasses other models of the identical size and competes closely with those having nearly double its parameters.

Constructing on this, Hugging Face introduced Zephyr 7B Alpha, showcasing that a fine-tuned Mistral 7B can indeed surpass the talents of significantly larger chat models and, in some tasks, even rival GPT-4. The “Alpha” was just the start, as Zephyr 7B Beta followed shortly.

This text will explore how Zephyr 7B leverages the ability of larger models to refine its ability to reply and align with human instruction, a process made possible through the technique of information distillation. This method involves training smaller models on the complex patterns learned by larger ones, reducing training demands without sacrificing language modeling capabilities. We’ll delve into the specifics of Hugging Face’s knowledge distillation approach.

Knowledge distillation

A key innovation in developing models like Zephyr-7B is distilled supervised fine-tuning (dSFT). This method involves using the output from a bigger, more capable ‘teacher’ model to coach a smaller ‘student’ model, enhancing its accuracy. While distillation improves open models on various tasks, a spot in performance in comparison with teacher models still exists.

Knowledge distillation is a technique in machine learning where a compact model, known as the “student,” is taught to duplicate the performance of a bigger, more complex “teacher” model. This system enables the coed to perform tasks that were previously beyond its capability by transferring the intricate patterns learned by the teacher.

Knowledge Distillation | Teacher-Student Model

The coed model trains on the output probabilities or features generated by the teacher model, specializing in matching these outputs slightly than simply the ultimate predictions. This enables the coed to learn the nuanced decision-making processes of the teacher, often leading to improved performance over training with only the bottom truth data.

Historically, knowledge distillation has been utilized in models like Hinton’s original distillation networks, and more recently in NLP with models corresponding to DistilBERT, which distilled the BERT model right into a smaller, faster version that retains many of the original’s language understanding capabilities. One other example is TinyBERT, which matches further in optimizing the dimensions and speed for mobile or edge devices.

Within the case of Zephyr-7B, knowledge distillation is used to imbue a smaller 7B parameter model with the capabilities of its larger counterparts. By doing so, Zephyr-7B achieves a balance between performance and efficiency, making it suitable for environments where computational resources are limited, without sacrificing the standard of interaction and understanding.

In developing Zephyr-7B, researchers tackled the challenge of aligning a small open LLM entirely through distillation. They introduced an approach called distilled direct preference optimization (dDPO), which uses AI Feedback from an ensemble of teacher models as preference data. This method, requiring no human annotation, significantly reduces the time and resources needed for model training.

Constructing ZEPHYR-7B

To validate dDPO, researchers constructed ZEPHYR-7B, an aligned version of the Mistral-7B model. The method involved three steps:

dSFT using the UltraChat dataset:Distilled Supervised Fantastic-Tuning (dSFT) is a complicated method to coach large language models (LLMs) by leveraging the output of larger, more capable “teacher” models. It begins with a raw LLM which is trained to reply to user prompts. Unlike traditional supervised fine-tuning (SFT) that uses a set dataset, dSFT employs a dynamic approach where the model itself generates instructions and responses. This method, generally known as self-instruct, involves using the teacher model to each answer and refine instructions based on responses.The method starts with a set of seed prompts (x₀₁, x₀₂, …, x₀_J) representing diverse topics. Each prompt is refined iteratively: for a given prompt x₀, a response y₀ is generated by the teacher model, after which a recent instruction x₁ is sampled based on x₀ and y₀. The ultimate dataset C = {(x₁, y₁), …, (x_J, y_J)} is used for fine-tuning the model.
Incorporating AI feedback data from UltraFeedback:This data was crucial for refining the model’s responses. On this step, the model generates responses to numerous prompts (like describing how one can make chocolate brownies) that are then ranked by a more advanced model corresponding to GPT-4. The best scoring response (yw) and a randomly chosen lower-scoring response (yl) form a feedback dataset D.
Applying dDPO:The last phase, Distilled Direct Preference Optimization (dDPO), involves refining the dSFT model by maximizing the probability of rating the popular responses higher. That is achieved through the use of a reward function rθ(x, y) within the preference model, which is predicated on the optimal LLM policy π* and the unique policy πdSFT. The optimization objective is formulated as πθ = max π E (x, yw, yl) ∼ D log σ (β log π(yw|x)/πdSFT(yw|x) − β log π(yl|x)/πdSFT(yl|x)), which simplifies the training process by starting with the dSFT version of the model and iterating through each AIF triple.

The method used in Zephyr-7B mirrors the processes utilized in InstructGPT.

The tactic utilized in Zephyr-7B mirrors the processes utilized in InstructGPT.

Remarkably, Zephyr-7B achieves performance comparable to much larger 70B-parameter models aligned with human feedback. It excels in each academic benchmarks and conversational capabilities, highlighting the effectiveness of preference learning in model development. For further exploration, models, code, and directions can be found at Hugging Face’s GitHub Repository.

Addressing the Challenge of Intent Alignment

A notable concern with LLMs has been their alignment with human intent. Previous models often failed to supply responses that matched user preferences, resulting in inaccurate or irrelevant answers. Nevertheless, recent benchmarks like MT-Bench and AlpacaEval have provided tools to quantify and improve this aspect, highlighting the superior performance of proprietary models trained with human feedback over those trained solely via distillation.

Evaluation Methods

The evaluation of Zephyr 7B involved rigorous testing across benchmarks that assess a model’s conversational abilities in each single and multi-turn contexts:

MT-Bench: This multi-turn benchmark requires a model to deal with 160 questions spanning eight domains. Each response is rated by GPT-4, with the model’s final rating reflecting the typical over two rounds of questions.
AlpacaEval: On this single-turn benchmark, the model is presented with 805 questions across various subjects. The main focus here is on the model’s helpfulness, with GPT-4 scoring the responses to find out a comparative win rate.

Moreover, Zephyr 7B was tested on the Open LLM Leaderboard, which, while not a direct assessment of conversational skills, offers insights into the model’s reasoning and truthfulness post-fine-tuning.

Zephyr 7B was in comparison with quite a lot of open and proprietary models, including those with different sizes and alignment methods. It established recent benchmarks for 7B models on MT-Bench and AlpacaEval and showed competitive performance against larger models, validating the effectiveness of direct preference optimization (dDPO) in training.

The SFT and DPO training phases were meticulously configured, spanning multiple epochs and fine-tuning learning rates and batch sizes for optimal performance. The ultimate Zephyr model emerged not only immune to overfitting but additionally enhanced in coping with practical tasks and academic benchmarks.

Datasets and Results

Datasets Utilized

Performance and Outcomes

The below chart illustrates the performance of Zephyr 7B across various task categories against other models corresponding to GPT-3.5-turbo, Claude 1, GPT-4, and Llama-2-70b-chat. Categories might include Writing, Humanities, Roleplay, Reasoning, STEM, Extraction, Coding, and Math.

From the chart, we are able to infer which domains Zephyr 7B excels in and which domains might need further improvement. As an illustration, if Zephyr’s line stretches further out on the Writing axis in comparison with others, it suggests that Zephyr is especially strong in generating written content. Conversely, if the road is closer to the middle on the Math axis, it could indicate a relative weakness in solving math problems.

The radar chart helps in identifying the strengths and weaknesses of Zephyr 7B, providing a visible representation of where it stands against larger models like GPT-4 and specialized models like Llama-2-70b-chat.

Model Performance Radar Chart

Comparing various language models on two benchmarks: MT-Bench and AlpacaEval. The models are evaluated based on their size, alignment method (corresponding to dSFT for distilled supervised fine-tuning or dDPO for distilled direct preference optimization), and performance scores. Zephyr stands out with high scores in each benchmarks, indicating its effectiveness in generating aligned responses.

MT-Bench and AlpacaEval

Conclusion

In conclusion, the event of Zephyr-7B demonstrates that alignment and distillation of conversational capabilities from a big language model (LLM) onto a smaller model might be achieved without reliance on sampling-based methods. By employing direct preference optimization (DPO) with AI feedback, Zephyr-7B leverages the strong foundation of Mistral-7B to set a recent benchmark for 7B parameter chat models, showcasing the power of smaller, open-source models to know and reply to user intent effectively.

Nevertheless, this study is just not without its limitations. The reliance on GPT-4 as an evaluator for benchmarks introduces a bias towards models which are distilled from it, potentially favoring over accurate responses. Moreover, the scalability of this method to larger models, corresponding to LLAMA2-70B, and its impact on performance gains remain areas for further research. These limitations highlight the necessity for continuous innovation and the event of unbiased evaluation methods within the AI community.

Looking beyond the study, it’s evident that the potential for smaller models to perform at the extent of larger counterparts can democratize AI, allowing for more accessible and efficient use in various applications. The success of Zephyr-7B encourages further exploration into open-source models, which may speed up advancements in AI by fostering collaborative research and development.