Welcome Mixtral – a SOTA Mixture of Experts on Hugging Face

Mixtral 8x7b is an exciting large language model released by Mistral today, which sets a brand new state-of-the-art for open-access models and outperforms GPT-3.5 across many benchmarks. We’re excited to support the launch with a comprehensive integration of Mixtral within the Hugging Face ecosystem 🔥!

Among the many features and integrations being released today, we now have:

What’s Mixtral 8x7b?

Mixtral has an identical architecture to Mistral 7B, but comes with a twist: it’s actually 8 “expert” models in a single, due to a way called Mixture of Experts (MoE). For transformers models, the best way this works is by replacing some Feed-Forward layers with a sparse MoE layer. A MoE layer incorporates a router network to pick which experts process which tokens most efficiently. Within the case of Mixtral, two experts are chosen for every timestep, which allows the model to decode on the speed of a 12B parameter-dense model, despite containing 4x the variety of effective parameters!

For more details on MoEs, see our accompanying blog post: hf.co/blog/moe

Mixtral release TL;DR;

Release of base and Instruct versions
Supports a context length of 32k tokens.
Outperforms Llama 2 70B and matches or beats GPT3.5 on most benchmarks
Speaks English, French, German, Spanish, and Italian.
Good at coding, with 40.2% on HumanEval
Commercially permissive with an Apache 2.0 license

So how good are the Mixtral models? Here’s an outline of the bottom model and its performance in comparison with other open models on the LLM Leaderboard (higher scores are higher):

For instruct and chat models, evaluating on benchmarks like MT-Bench or AlpacaEval is healthier. Below, we show how Mixtral Instruct performs up against the highest closed and open access models (higher scores are higher):

Impressively, Mixtral Instruct outperforms all other open-access models on MT-Bench and is the primary one to realize comparable performance with GPT-3.5!

In regards to the name

The Mixtral MoE known as Mixtral-8x7B, however it doesn’t have 56B parameters. Shortly after the discharge, we found that some people were misled into pondering that the model behaves similarly to an ensemble of 8 models with 7B parameters each, but that is not how MoE models work. Just some layers of the model (the feed-forward blocks) are replicated; the remainder of the parameters are the identical as in a 7B model. The entire variety of parameters will not be 56B, but about 45B. A greater name might have been Mixtral-45-8e to higher convey the architecture. For more details about how MoE works, please check with our “Mixture of Experts Explained” post.

Prompt format

The base model has no prompt format. Like other base models, it could actually be used to proceed an input sequence with a plausible continuation or for zero-shot/few-shot inference. It’s also an excellent foundation for fine-tuning your personal use case. The Instruct model has a quite simple conversation structure.

 [INST] User Instruction 1 [/INST] Model answer 1 [INST] User instruction 2[/INST]

This format needs to be exactly reproduced for effective use. We’ll show later how easy it’s to breed the instruct prompt with the chat template available in transformers.

What we do not know

Just like the previous Mistral 7B release, there are several open questions on this latest series of models. Particularly, we now have no information concerning the size of the dataset used for pretraining, its composition, or the way it was preprocessed.

Similarly, for the Mixtral instruct model, no details have been shared concerning the fine-tuning datasets or the hyperparameters related to SFT and DPO.

Demo

You possibly can chat with the Mixtral Instruct model on Hugging Face Chat! Test it out here: https://huggingface.co/chat/?model=mistralai/Mixtral-8x7B-Instruct-v0.1.

Inference

We offer two important ways to run inference with Mixtral models:

Via the pipeline() function of 🤗 Transformers.
With Text Generation Inference, which supports advanced features like continuous batching, tensor parallelism, and more, for blazing fast results.

For every method, it is feasible to run the model in half-precision (float16) or with quantized weights. Because the Mixtral model is roughly equivalent in size to a 45B parameter dense model, we are able to estimate the minimum amount of VRAM needed as follows:

Precision	Required VRAM
float16	>90 GB
8-bit	>45 GB
4-bit	>23 GB

Using 🤗 Transformers

With transformers release 4.36, you should utilize Mixtral and leverage all of the tools inside the Hugging Face ecosystem, equivalent to:

training and inference scripts and examples
secure file format (safetensors)
integrations with tools equivalent to bitsandbytes (4-bit quantization), PEFT (parameter efficient fine-tuning), and Flash Attention 2
utilities and helpers to run generation with the model
mechanisms to export the models to deploy

Be certain that to make use of a recent version of transformers:

pip install --upgrade transformers

In the next code snippet, we show run inference with 🤗 Transformers and 4-bit quantization. As a result of the big size of the model, you’ll need a card with a minimum of 30 GB of RAM to run it. This includes cards equivalent to A100 (80 or 40GB versions), or A6000 (48 GB).

from transformers import pipeline
import torch

model = "mistralai/Mixtral-8x7B-Instruct-v0.1"

pipe = pipeline(
    "text-generation",
    model=model,
    model_kwargs={"torch_dtype": torch.float16, "load_in_4bit": True},
)

messages = [{"role": "user", "content": "Explain what a Mixture of Experts is in less than 100 words."}]
outputs = pipe(messages, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"][-1]["content"])

[INST] Explain what a Mixture of Experts is in lower than 100 words. [/INST] A
Mixture of Experts is an ensemble learning method that mixes multiple models,
or “experts,” to make more accurate predictions. Each expert focuses on a
different subset of the information, and a gating network determines the suitable
expert to make use of for a given input. This approach allows the model to adapt to
complex, non-linear relationships in the information and improve overall performance.

Using Text Generation Inference

Text Generation Inference is a production-ready inference container developed by Hugging Face to enable easy deployment of enormous language models. It has features equivalent to continuous batching, token streaming, tensor parallelism for fast inference on multiple GPUs, and production-ready logging and tracing.

You possibly can deploy Mixtral on Hugging Face’s Inference Endpoints, which uses Text Generation Inference because the backend. To deploy a Mixtral model, go to the model page and click on on the Deploy -> Inference Endpoints widget.

Note: You would possibly must request a quota upgrade via email to api-enterprise@huggingface.co to access A100s

You possibly can learn more on Deploy LLMs with Hugging Face Inference Endpoints in our blog. The blog includes details about supported hyperparameters and stream your response using Python and Javascript.

You may as well run Text Generation Inference locally on 2x A100s (80GB) with Docker as follows:

docker run --gpus all --shm-size 1g -p 3000:80 -v /data:/data ghcr.io/huggingface/text-generation-inference:1.3.0 
    --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 
    --num-shard 2 
    --max-batch-total-tokens 1024000 
    --max-total-tokens 32000

High-quality-tuning with 🤗 TRL

Training LLMs will be technically and computationally difficult. On this section, we take a look at the tools available within the Hugging Face ecosystem to efficiently train Mixtral on a single A100 GPU.

An example command to fine-tune Mixtral on OpenAssistant’s chat dataset will be found below. To conserve memory, we make use of 4-bit quantization and QLoRA to focus on all of the linear layers in the eye blocks. Note that unlike dense transformers, one shouldn’t goal the MLP layers as they’re sparse and don’t interact well with PEFT.

First, install the nightly version of 🤗 TRL and clone the repo to access the training script:

pip install -U transformers
pip install git+https://github.com/huggingface/trl
git clone https://github.com/huggingface/trl
cd trl

Then you definately can run the script:

speed up launch --config_file examples/accelerate_configs/multi_gpu.yaml --num_processes=1 
    examples/scripts/sft.py 
    --model_name mistralai/Mixtral-8x7B-v0.1 
    --dataset_name trl-lib/ultrachat_200k_chatml 
    --batch_size 2 
    --gradient_accumulation_steps 1 
    --learning_rate 2e-4 
    --save_steps 200_000 
    --use_peft 
    --peft_lora_r 16 --peft_lora_alpha 32 
    --target_modules q_proj k_proj v_proj o_proj 
    --load_in_4bit

This takes about 48 hours to coach on a single A100, but will be easily parallelised by tweaking --num_processes to the variety of GPUs you will have available.

Quantizing Mixtral

As seen above, the challenge for this model is to make it run on consumer-type hardware for anyone to make use of it, because the model requires ~90GB simply to be loaded in half-precision (torch.float16).

With the 🤗 transformers library, we support out-of-the-box inference with state-of-the-art quantization methods equivalent to QLoRA and GPTQ. You possibly can read more concerning the quantization methods we support within the appropriate documentation section.

Load Mixtral with 4-bit quantization

As demonstrated within the inference section, you may load Mixtral with 4-bit quantization by installing the bitsandbytes library (pip install -U bitsandbytes) and passing the flag load_in_4bit=True to the from_pretrained method. For higher performance, we advise users to load the model with bnb_4bit_compute_dtype=torch.float16. Note you would like a GPU device with a minimum of 30GB VRAM to properly run the snippet below.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "mistralai/Mixtral-8x7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_id)

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)

prompt = "[INST] Explain what a Mixture of Experts is in lower than 100 words. [/INST]"
inputs = tokenizer(prompt, return_tensors="pt").to(0)

output = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))

This 4-bit quantization technique was introduced within the QLoRA paper, you may read more about it within the corresponding section of the documentation or in this post.

Load Mixtral with GPTQ

The GPTQ algorithm is a post-training quantization technique where each row of the burden matrix is quantized independently to search out a version of the weights that minimizes the error. These weights are quantized to int4, but they’re restored to fp16 on the fly during inference. In contrast with 4-bit QLoRA, GPTQ needs the model to be calibrated with a dataset with the intention to be quantized. Ready-to-use GPTQ models are shared on the 🤗 Hub by TheBloke, so anyone can use them without having to calibrate them first.

For Mixtral, we needed to tweak the calibration approach by ensuring we don’t quantize the expert gating layers for higher performance. The ultimate perplexity (lower is healthier) of the quantized model is 4.40 vs 4.25 for the half-precision model. The quantized model will be found here, and to run it with 🤗 transformers you first must update the auto-gptq and optimum libraries:

pip install -U optimum auto-gptq

You furthermore mght need to put in transformers from source:

pip install -U git+https://github.com/huggingface/transformers.git

Once installed, simply load the GPTQ model with the from_pretrained method:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "TheBloke/Mixtral-8x7B-v0.1-GPTQ"
tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

prompt = "[INST] Explain what a Mixture of Experts is in lower than 100 words. [/INST]"
inputs = tokenizer(prompt, return_tensors="pt").to(0)

output = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Note that for each QLoRA and GPTQ you would like a minimum of 30 GB of GPU VRAM to suit the model. You possibly can make it work with 24 GB if you happen to use device_map="auto", like in the instance above, so some layers are offloaded to CPU.

Disclaimers and ongoing work

Quantization: Quantization of MoEs is an energetic area of research. Some initial experiments we have done with TheBloke are shown above, but we expect more progress as this architecture is understood higher! It can be exciting to see the event in the approaching days and weeks on this area. Moreover, recent work equivalent to QMoE, which achieves sub-1-bit quantization for MoEs, could possibly be applied here.
High VRAM usage: MoEs run inference in a short time but still need a considerable amount of VRAM (and hence an expensive GPU). This makes it difficult to make use of it in local setups. MoEs are great for setups with many devices and huge VRAM. Mixtral requires 90GB of VRAM in half-precision 🤯

Additional Resources

Conclusion

We’re very enthusiastic about Mixtral being released! In the approaching days, be able to learn more about ways to fine-tune and deploy Mixtral.

Source link

Welcome Mixtral – a SOTA Mixture of Experts on Hugging Face

Table of Contents

What’s Mixtral 8x7b?

In regards to the name

Prompt format

What we do not know

Demo

Inference

Using 🤗 Transformers

Using Text Generation Inference

High-quality-tuning with 🤗 TRL

Quantizing Mixtral

Load Mixtral with 4-bit quantization

Load Mixtral with GPTQ

Disclaimers and ongoing work

Additional Resources

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Zero-Waste Agentic RAG: Designing Caching Architectures to Minimize Latency and LLM Costs at Scale

Context Engineering as Your Competitive Edge

Constructing Telco Reasoning Models for Autonomous Networks with NVIDIA NeMo

5 Latest Digital Twin Products Developers Can Use to Construct 6G Networks

Claude Skills and Subagents: Escaping the Prompt Engineering Hamster Wheel

Welcome Mixtral – a SOTA Mixture of Experts on Hugging Face

Table of Contents

What’s Mixtral 8x7b?

In regards to the name

Prompt format

What we do not know

Demo

Inference

Using 🤗 Transformers

Using Text Generation Inference

High-quality-tuning with 🤗 TRL

Quantizing Mixtral

Load Mixtral with 4-bit quantization

Load Mixtral with GPTQ

Disclaimers and ongoing work

Additional Resources

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.