Overview of natively supported quantization schemes in 🤗 Transformers

We aim to offer a transparent overview of the professionals and cons of every quantization scheme supported in transformers to make it easier to resolve which one it’s best to go for.

Currently, quantizing models are used for 2 important purposes:

Running inference of a giant model on a smaller device
Advantageous-tune adapters on top of quantized models

Thus far, two integration efforts have been made and are natively supported in transformers : bitsandbytes and auto-gptq.
Note that some additional quantization schemes are also supported within the 🤗 optimum library, but that is out of scope for this blogpost.

To learn more about each of the supported schemes, please have a have a look at one in all the resources shared below. Please even have a have a look at the suitable sections of the documentation.

Note also that the main points shared below are only valid for PyTorch models, that is currently out of scope for Tensorflow and Flax/JAX models.

Resources

Comparing bitsandbytes and auto-gptq

On this section, we’ll go over the professionals and cons of bitsandbytes and gptq quantization. Note that these are based on the feedback from the community and so they can evolve over time as a few of these features are within the roadmap of the respective libraries.

What are the advantages of bitsandbytes?

easy: bitsandbytes still stays the best method to quantize any model because it doesn’t require calibrating the quantized model with input data (also called zero-shot quantization). It is feasible to quantize any model out of the box so long as it comprises torch.nn.Linear modules. Every time a brand new architecture is added in transformers, so long as they could be loaded with speed up’s device_map=”auto”, users can profit from bitsandbytes quantization straight out of the box with minimal performance degradation. Quantization is performed on model load, no must run any post-processing or preparation step.

cross-modality interoperability: Because the only condition to quantize a model is to contain a torch.nn.Linear layer, quantization works out of the box for any modality, making it possible to load models equivalent to Whisper, ViT, Blip2, etc. in 8-bit or 4-bit out of the box.

0 performance degradation when merging adapters: (Read more about adapters and PEFT in this blogpost in the event you should not conversant in it). If you happen to train adapters on top of the quantized base model, the adapters could be merged on top of of the bottom model for deployment, with no inference performance degradation. It’s also possible to merge the adapters on top of the dequantized model ! This is just not supported for GPTQ.

What are the advantages of autoGPTQ?

fast for text generation: GPTQ quantized models are fast in comparison with bitsandbytes quantized models for text generation. We’ll address the speed comparison in an appropriate section.

n-bit support: The GPTQ algorithm makes it possible to quantize models as much as 2 bits! Nevertheless, this might include severe quality degradation. The advisable variety of bits is 4, which appears to be an awesome tradeoff for GPTQ presently.

easily-serializable: GPTQ models support serialization for any variety of bits. Loading models from TheBloke namespace: https://huggingface.co/TheBloke (search for those who end with the -GPTQ suffix) is supported out of the box, so long as you have got the required packages installed. Bitsandbytes supports 8-bit serialization but doesn’t support 4-bit serialization as of today.

AMD support: The mixing should work out of the box for AMD GPUs!

What are the potential rooms of improvements of bitsandbytes?

slower than GPTQ for text generation: bitsandbytes 4-bit models are slow in comparison with GPTQ when using generate.

4-bit weights should not serializable: Currently, 4-bit models can’t be serialized. It is a frequent community request, and we imagine it needs to be addressed very soon by the bitsandbytes maintainers because it’s of their roadmap!

What are the potential rooms of improvements of autoGPTQ?

calibration dataset: The necessity of a calibration dataset might discourage some users to go for GPTQ. Moreover, it may take several hours to quantize the model (e.g. 4 GPU hours for a 175B scale model in accordance with the paper – section 2)

works just for language models (for now): As of today, the API for quantizing a model with auto-GPTQ has been designed to support only language models. It needs to be possible to quantize non-text (or multimodal) models using the GPTQ algorithm, but the method has not been elaborated in the unique paper or within the auto-gptq repository. If the community is happy about this topic this is likely to be considered in the long run.

Diving into speed benchmarks

We decided to supply an intensive benchmark for each inference and fine-tuning adapters using bitsandbytes and auto-gptq on different hardware. The inference benchmark should give users an idea of the speed difference they could get between the several approaches we propose for inference, and the adapter fine-tuning benchmark should give a transparent idea to users in the case of deciding which approach to make use of when fine-tuning adapters on top of bitsandbytes and GPTQ base models.

We’ll use the next setup:

bitsandbytes: 4-bit quantization with bnb_4bit_compute_dtype=torch.float16. Be sure that to make use of bitsandbytes>=0.41.1 for fast 4-bit kernels.
auto-gptq: 4-bit quantization with exllama kernels. You have to auto-gptq>=0.4.0 to make use of ex-llama kernels.

Inference speed (forward pass only)

This benchmark measures only the prefill step, which corresponds to the forward pass during training. It was run on a single NVIDIA A100-SXM4-80GB GPU with a prompt length of 512. The model we used was meta-llama/Llama-2-13b-hf.

with batch size = 1:

quantization	act_order	bits	group_size	kernel	Load time (s)	Per-token latency (ms)	Throughput (tok/s)	Peak memory (MB)
fp16	None	None	None	None	26.0	36.958	27.058	29152.98
gptq	False	4	128	exllama	36.2	33.711	29.663	10484.34
bitsandbytes	None	4	None	None	37.64	52.00	19.23	11018.36

with batch size = 16:

quantization	act_order	bits	group_size	kernel	Load time (s)	Per-token latency (ms)	Throughput (tok/s)	Peak memory (MB)
fp16	None	None	None	None	26.0	69.94	228.76	53986.51
gptq	False	4	128	exllama	36.2	95.41	167.68	34777.04
bitsandbytes	None	4	None	None	37.64	113.98	140.38	35532.37

From the benchmark, we will see that bitsandbyes and GPTQ are equivalent, with GPTQ being barely faster for giant batch size. Check this link to have more details on these benchmarks.

Generate speed

The next benchmarks measure the generation speed of the model during inference. The benchmarking script could be found here for reproducibility.

use_cache

Let’s test use_cache to raised understand the impact of caching the hidden state in the course of the generation.

The benchmark was run on an A100 with a prompt length of 30 and we generated exactly 30 tokens. The model we used was meta-llama/Llama-2-7b-hf.

with use_cache=True

with use_cache=False

From the 2 benchmarks, we conclude that generation is quicker once we use attention caching, as expected. Furthermore, GPTQ is, usually, faster than bitsandbytes. For instance, with batch_size=4 and use_cache=True, it’s twice as fast! Due to this fact let’s use use_cache for the subsequent benchmarks. Note that use_cache will eat more memory.

Hardware

In the next benchmark, we’ll try different hardware to see the impact on the quantized model. We used a prompt length of 30 and we generated exactly 30 tokens. The model we used was meta-llama/Llama-2-7b-hf.

with a NVIDIA A100:

with a NVIDIA T4:

with a Titan RTX:

From the benchmark above, we will conclude that GPTQ is quicker than bitsandbytes for those three GPUs.

Generation length

In the next benchmark, we’ll try different generation lengths to see their impact on the quantized model. It was run on a A100 and we used a prompt length of 30, and varied the variety of generated tokens. The model we used was meta-llama/Llama-2-7b-hf.

with 30 tokens generated:

with 512 tokens generated:

From the benchmark above, we will conclude that GPTQ is quicker than bitsandbytes independently of the generation length.

Adapter fine-tuning (forward + backward)

It is just not possible to perform pure training on a quantized model. Nevertheless, you possibly can fine-tune quantized models by leveraging parameter efficient high quality tuning methods (PEFT) and train adapters on top of them. The fine-tuning method will depend on a recent method called “Low Rank Adapters” (LoRA): as an alternative of fine-tuning your entire model you only should fine-tune these adapters and cargo them properly contained in the model. Let’s compare the fine-tuning speed!

The benchmark was run on a NVIDIA A100 GPU and we used meta-llama/Llama-2-7b-hf model from the Hub. Note that for GPTQ model, we needed to disable the exllama kernels as exllama is just not supported for fine-tuning.

From the result, we conclude that bitsandbytes is quicker than GPTQ for fine-tuning.

Performance degradation

Quantization is great for reducing memory consumption. Nevertheless, it does include performance degradation. Let’s compare the performance using the Open-LLM leaderboard !

with 7b model:

model_id	Average	ARC	Hellaswag	MMLU	TruthfulQA
meta-llama/llama-2-7b-hf	54.32	53.07	78.59	46.87	38.76
meta-llama/llama-2-7b-hf-bnb-4bit	53.4	53.07	77.74	43.8	38.98
TheBloke/Llama-2-7B-GPTQ	53.23	52.05	77.59	43.99	39.32

with 13b model:

model_id	Average	ARC	Hellaswag	MMLU	TruthfulQA
meta-llama/llama-2-13b-hf	58.66	59.39	82.13	55.74	37.38
TheBloke/Llama-2-13B-GPTQ (revision = ‘gptq-4bit-128g-actorder_True’)	58.03	59.13	81.48	54.45	37.07
TheBloke/Llama-2-13B-GPTQ	57.56	57.25	81.66	54.81	36.56
meta-llama/llama-2-13b-hf-bnb-4bit	56.9	58.11	80.97	54.34	34.17

From the outcomes above, we conclude that there’s less degradation in larger models. More interestingly, the degradation is minimal!

Conclusion and final words

On this blogpost, we compared bitsandbytes and GPTQ quantization across multiple setups. We saw that bitsandbytes is best suited to fine-tuning while GPTQ is best for generation. From this commentary, one method to get well merged models can be to:

(1) quantize the bottom model using bitsandbytes (zero-shot quantization)
(2) add and fine-tune the adapters
(3) merge the trained adapters on top of the bottom model or the dequantized model !
(4) quantize the merged model using GPTQ and use it for deployment

We hope that this overview will make it easier for everybody to make use of LLMs of their applications and usecases, and we’re looking forward to seeing what you’ll construct with it!

Acknowledgements

We would love to thank Ilyas, Clémentine and Felix for his or her assistance on the benchmarking.

Finally, we would love to thank Pedro Cuenca for his help with the writing of this blogpost.

Source link

Overview of natively supported quantization schemes in 🤗 Transformers

Table of contents

Resources