Large language models have demonstrated remarkable capabilities in understanding and generating human-like text, revolutionizing applications across various domains. Nonetheless, the demands they place on consumer hardware for training and deployment have grow to be increasingly difficult to fulfill.
🤗 Hugging Face’s core mission is to democratize good machine learning, and this includes making large models as accessible as possible for everybody. In the identical spirit as our bitsandbytes collaboration, we have now just integrated the AutoGPTQ library in Transformers, making it possible for users to quantize and run models in 8, 4, 3, and even 2-bit precision using the GPTQ algorithm (Frantar et al. 2023). There may be negligible accuracy degradation with 4-bit quantization, with inference speed comparable to the fp16 baseline for small batch sizes. Note that GPTQ method barely differs from post-training quantization methods proposed by bitsandbytes because it requires to pass a calibration dataset.
This integration is on the market each for Nvidia GPUs, and RoCm-powered AMD GPUs.
Table of contents
Resources
This blogpost and release include several resources to start with GPTQ quantization:
A delicate summary of the GPTQ paper
Quantization methods often belong to one in all two categories:
- Post-Training Quantization (PTQ): We quantize a pre-trained model using moderate resources, akin to a calibration dataset and a couple of hours of computation.
- Quantization-Aware Training (QAT): Quantization is performed before training or further fine-tuning.
GPTQ falls into the PTQ category and this is especially interesting for large models, for which full model training and even fine-tuning might be very expensive.
Specifically, GPTQ adopts a mixed int4/fp16 quantization scheme where weights are quantized as int4 while activations remain in float16. During inference, weights are dequantized on the fly and the actual compute is performed in float16.
The advantages of this scheme are twofold:
- Memory savings near x4 for int4 quantization, because the dequantization happens near the compute unit in a fused kernel, and never within the GPU global memory.
- Potential speedups because of the time saved on data communication attributable to the lower bitwidth used for weights.
The GPTQ paper tackles the layer-wise compression problem:
Given a layer with weight matrix and layer input , we wish to seek out a quantized version of the load to attenuate the mean squared error (MSE):
Once that is solved per layer, an answer to the worldwide problem might be obtained by combining the layer-wise solutions.
In an effort to solve this layer-wise compression problem, the writer uses the Optimal Brain Quantization framework (Frantar et al 2022). The OBQ method starts from the remark that the above equation might be written because the sum of the squared errors, over each row of .
Which means we will quantize each row independently. This is known as per-channel quantization. For every row , OBQ quantizes one weight at a time while all the time updating all not-yet-quantized weights, so as to compensate for the error incurred by quantizing a single weight. The update on chosen weights has a closed-form formula, utilizing Hessian matrices.
The GPTQ paper improves this framework by introducing a set of optimizations that reduces the complexity of the quantization algorithm while retaining the accuracy of the model.
In comparison with OBQ, the quantization step itself can be faster with GPTQ: it takes 2 GPU-hours to quantize a BERT model (336M) with OBQ, whereas with GPTQ, a Bloom model (176B) might be quantized in lower than 4 GPU-hours.
To learn more in regards to the exact algorithm and the various benchmarks on perplexity and speedups, try the unique paper.
AutoGPTQ library – the one-stop library for efficiently leveraging GPTQ for LLMs
The AutoGPTQ library enables users to quantize 🤗 Transformers models using the GPTQ method. While parallel community efforts akin to GPTQ-for-LLaMa, Exllama and llama.cpp implement quantization methods strictly for the Llama architecture, AutoGPTQ gained popularity through its smooth coverage of a big selection of transformer architectures.
For the reason that AutoGPTQ library has a bigger coverage of transformers models, we decided to supply an integrated 🤗 Transformers API to make LLM quantization more accessible to everyone. Right now we have now integrated probably the most common optimization options, akin to CUDA kernels. For more advanced options like Triton kernels or fused-attention compatibility, try the AutoGPTQ library.
Native support of GPTQ models in 🤗 Transformers
After installing the AutoGPTQ library and optimum (pip install optimum), running GPTQ models in Transformers is now so simple as:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7b-Chat-GPTQ", torch_dtype=torch.float16, device_map="auto")
Try the Transformers documentation to learn more about all of the features.
Our AutoGPTQ integration has many benefits:
- Quantized models are serializable and might be shared on the Hub.
- GPTQ drastically reduces the memory requirements to run LLMs, while the inference latency is on par with FP16 inference.
- AutoGPTQ supports Exllama kernels for a big selection of architectures.
- The mixing comes with native RoCm support for AMD GPUs.
- Finetuning with PEFT is on the market.
You possibly can check on the Hub in case your favorite model has already been quantized. TheBloke, one in all Hugging Face top contributors, has quantized a whole lot of models with AutoGPTQ and shared them on the Hugging Face Hub. We worked together to be sure that that these repositories will work out of the box with our integration.
It is a benchmark sample for the batch size = 1 case. The benchmark was run on a single NVIDIA A100-SXM4-80GB GPU. We used a prompt length of 512, and generated exactly 512 latest tokens. The primary row is the unquantized fp16 baseline, while the opposite rows show memory consumption and performance using different AutoGPTQ kernels.
| gptq | act_order | bits | group_size | kernel | Load time (s) | Per-token latency (ms) | Throughput (tokens/s) | Peak memory (MB) |
|---|---|---|---|---|---|---|---|---|
| False | None | None | None | None | 26.0 | 36.958 | 27.058 | 29152.98 |
| True | False | 4 | 128 | exllama | 36.2 | 33.711 | 29.663 | 10484.34 |
| True | False | 4 | 128 | autogptq-cuda-old | 36.2 | 46.44 | 21.53 | 10344.62 |
A more comprehensive reproducible benchmark is on the market here.
Quantizing models with the Optimum library
To seamlessly integrate AutoGPTQ into Transformers, we used a minimalist version of the AutoGPTQ API that is on the market in Optimum, Hugging Face’s toolkit for training and inference optimization. By following this approach, we achieved easy integration with Transformers, while allowing people to make use of the Optimum API in the event that they need to quantize their very own models! Try the Optimum documentation if you must quantize your individual LLMs.
Quantizing 🤗 Transformers models with the GPTQ method might be done in a couple of lines:
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
model_id = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_id)
quantization_config = GPTQConfig(bits=4, dataset = "c4", tokenizer=tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=quantization_config)
Quantizing a model may take a protracted time. Note that for a 175B model, not less than 4 GPU-hours are required if one uses a big dataset (e.g. `”c4″“). As mentioned above, many GPTQ models are already available on the Hugging Face Hub, which bypasses the necessity to quantize a model yourself in most use cases. Nevertheless, you can too quantize a model using your individual dataset appropriate for the actual domain you’re working on.
Running GPTQ models through Text-Generation-Inference
In parallel to the combination of GPTQ in Transformers, GPTQ support was added to the Text-Generation-Inference library (TGI), aimed toward serving large language models in production. GPTQ can now be used alongside features akin to dynamic batching, paged attention and flash attention for a big selection of architectures.
For instance, this integration allows to serve a 70B model on a single A100-80GB GPU! This is just not possible using a fp16 checkpoint because it exceeds the available GPU memory.
You will discover out more in regards to the usage of GPTQ in TGI in the documentation.
Note that the kernel integrated in TGI doesn’t scale thoroughly with larger batch sizes. Although this approach saves memory, slowdowns are expected at larger batch sizes.
High quality-tune quantized models with PEFT
You possibly can not further train a quantized model using the regular methods. Nonetheless, by leveraging the PEFT library, you’ll be able to train adapters on top! To do this, we freeze all of the layers of the quantized model and add the trainable adapters. Listed below are some examples on methods to use PEFT with a GPTQ model: colab notebook and finetuning script.
Room for improvement
Our AutoGPTQ integration already brings impressive advantages at a small cost in the standard of prediction. There remains to be room for improvement, each within the quantization techniques and the kernel implementations.
First, while AutoGPTQ integrates (to the most effective of our knowledge) with probably the most performant W4A16 kernel (weights as int4, activations as fp16) from the exllama implementation, there may be a superb probability that the kernel can still be improved. There have been other promising implementations from Kim et al. and from MIT Han Lab that look like promising. Furthermore, from internal benchmarks, there appears to still be no open-source performant W4A16 kernel written in Triton, which could possibly be a direction to explore.
On the quantization side, let’s emphasize again that this method only quantizes the weights. There have been other approaches proposed for LLM quantization that may quantize each weights and activations at a small cost in prediction quality, akin to LLM-QAT where a mixed int4/int8 scheme might be used, in addition to quantization of the key-value cache. Certainly one of the strong benefits of this system is the flexibility to make use of actual integer arithmetic for the compute, with e.g. Nvidia Tensor Cores supporting int8 compute. Nonetheless, to the most effective of our knowledge, there are not any open-source W4A8 quantization kernels available, but this will well be an interesting direction to explore.
On the kernel side as well, designing performant W4A16 kernels for larger batch sizes stays an open challenge.
Supported models
On this initial implementation, only large language models with a decoder or encoder only architecture are supported. This may occasionally sound a bit restrictive, nevertheless it encompasses most state-of-the-art LLMs akin to Llama, OPT, GPT-Neo, GPT-NeoX.
Very large vision, audio, and multi-modal models are currently not supported.
Conclusion and final words
On this blogpost we have now presented the combination of the AutoGPTQ library in Transformers, making it possible to quantize LLMs with the GPTQ method to make them more accessible for anyone in the neighborhood and empower them to construct exciting tools and applications with LLMs.
This integration is on the market each for Nvidia GPUs, and RoCm-powered AMD GPUs, which is a big step towards democratizing quantized models for broader GPU architectures.
The collaboration with the AutoGPTQ team has been very fruitful, and we’re very grateful for his or her support and their work on this library.
We hope that this integration will make it easier for everybody to make use of LLMs of their applications, and we’re looking forward to seeing what you’ll construct with it!
Don’t miss the useful resources shared above for higher understanding the combination and methods to quickly start with GPTQ quantization.
Acknowledgements
We would love to thank William for his support and his work on the amazing AutoGPTQ library and for his assist in the combination.
We might also wish to thank TheBloke for his work on quantizing many models with AutoGPTQ and sharing them on the Hub and for his help with the combination.
We might also wish to aknowledge qwopqwop200 for his continuous contributions on AutoGPTQ library and his work on extending the library for CPU that’s going to be released in the subsequent versions of AutoGPTQ.
Finally, we would love to thank Pedro Cuenca for his help with the writing of this blogpost.
