a PyTorch quantization backend for Optimum

Quantization is a method to scale back the computational and memory costs of evaluating Deep Learning Models by representing their weights and activations with low-precision data types like 8-bit integer (int8) as a substitute of the standard 32-bit floating point (float32).

Reducing the variety of bits means the resulting model requires less memory storage, which is crucial for deploying Large Language Models on consumer devices.
It also enables specific optimizations for lower bitwidth datatypes, akin to int8 or float8 matrix multiplications on CUDA devices.

Many open-source libraries can be found to quantize pytorch Deep Learning Models, each providing very powerful features, yet often restricted to specific model configurations and devices.

Also, although they’re based on the identical design principles, they’re unfortunately often incompatible with each other.

Today, we’re excited to introduce quanto, a PyTorch quantization backend for Optimum.

It has been designed with versatility and ease in mind:

all features can be found in eager mode (works with non-traceable models),
quantized models might be placed on any device (including CUDA and MPS),
robotically inserts quantization and dequantization stubs,
robotically inserts quantized functional operations,
robotically inserts quantized modules (see below the list of supported modules),
provides a seamless workflow from a float model to a dynamic to a static quantized model,
serialization compatible with PyTorch weight_only and 🤗 Safetensors,
accelerated matrix multiplications on CUDA devices (int8-int8, fp16-int4, bf16-int8, bf16-int4),
supports int2, int4, int8 and float8 weights,
supports int8 and float8 activations.

Recent quantization methods seem like focused on quantizing Large Language Models (LLMs), whereas quanto intends to supply very simple quantization primitives for easy quantization schemes (linear quantization, per-group quantization) which are adaptable across any modality.

Quantization workflow

Quanto is out there as a pip package.

pip install optimum-quanto

A typical quantization workflow consists of the next steps:

1. Quantize

Step one converts a normal float model right into a dynamically quantized model.

from optimum.quanto import quantize, qint8

quantize(model, weights=qint8, activations=qint8)

At this stage, only the inference of the model is modified to dynamically quantize the weights.

2. Calibrate (optional if activations will not be quantized)

Quanto supports a calibration mode that permits the recording of the activation ranges while passing representative samples through the quantized model.

from optimum.quanto import Calibration

with Calibration(momentum=0.9):
    model(samples)

This robotically prompts the quantization of the activations within the quantized modules.

3. Tune, aka Quantization-Aware-Training (optional)

If the performance of the model degrades an excessive amount of, one can tune it for a number of epochs to get well the float model performance.

import torch

model.train()
for batch_idx, (data, goal) in enumerate(train_loader):
    data, goal = data.to(device), goal.to(device)
    optimizer.zero_grad()
    output = model(data).dequantize()
    loss = torch.nn.functional.nll_loss(output, goal)
    loss.backward()
    optimizer.step()

4. Freeze integer weights

When freezing a model, its float weights are replaced by quantized weights.

from optimum.quanto import freeze

freeze(model)

5. Serialize quantized model

Quantized models weights might be serialized to a state_dict, and saved to a file.
Each pickle and safetensors (really useful) are supported.

from safetensors.torch import save_file

save_file(model.state_dict(), 'model.safetensors')

With a purpose to reload these weights, you furthermore may must store the quantized
models quantization map.

import json

from optimum.quanto import quantization_map

with open('quantization_map.json', w) as f:
  json.dump(quantization_map(model))

5. Reload a quantized model

A serialized quantized model might be reloaded from a state_dict and a quantization_map using the requantize helper.
Note that you should first instantiate an empty model.

import json

from safetensors.torch import load_file

state_dict = load_file('model.safetensors')
with open('quantization_map.json', r) as f:
  quantization_map = json.load(f)


with torch.device('meta'):
  new_model = ...
requantize(new_model, state_dict, quantization_map, device=torch.device('cuda'))

Please confer with the examples for instantiations of the quantization workflow.
You can too check this notebook where we show you methods to quantize a BLOOM model with quanto!

Performance

Below are two graphs evaluating the accuracy of various quantized configurations for meta-llama/Meta-Llama-3.1-8B.

Note: the primary bar in each group all the time corresponds to the non-quantized model.

These results are obtained without applying any Post-Training-Optimization algorithm like hqq or AWQ.

The graph below gives the latency per-token measured on an NVIDIA A10 GPU.

meta-llama/Meta-Llama-3.1-8B Mean latency per token

Stay tuned for updated results as we’re consistently improving quanto with optimizers and optimized kernels.

Please confer with the quanto benchmarks for detailed results for various model architectures and configurations.

Integration in transformers

Quanto is seamlessly integrated within the Hugging Face transformers library. You may quantize any model by passing a QuantoConfig to from_pretrained!

Currently, you should use the newest version of speed up to make sure that the combination is fully compatible.

from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig

model_id = "facebook/opt-125m"
tokenizer = AutoTokenizer.from_pretrained(model_id)

quantization_config = QuantoConfig(weights="int8")

quantized_model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config= quantization_config
)

You may quantize the weights and/or activations in int8, float8, int4, or int2 by simply passing the proper argument in QuantoConfig. The activations might be either in int8 or float8. For float8, you should have hardware that’s compatible with float8 precision, otherwise quanto will silently upcast the weights and activations to torch.float32 or torch.float16 (depending on the unique data kind of the model) once we perform the matmul (only when the load is quantized). For those who try to make use of float8 using MPS devices, torch will currently raise an error.

Quanto is device agnostic, meaning you’ll be able to quantize and run your model regardless should you are on CPU/GPU/ MPS (Apple Silicon).

Quanto can also be torch.compile friendly. You may quantize a model with quanto and call torch.compile to the model to compile it for faster generation. This feature may not work out of the box if dynamic quantization is involved (i.e., Quantization Aware Training or quantized activations enabled). Make certain to maintain activations=None when creating your QuantoConfig in case you utilize the transformers integration.

Additionally it is possible to quantize any model, whatever the modality using quanto! We display methods to quantize openai/whisper-large-v3 model in int8 using quanto.

from transformers import AutoModelForSpeechSeq2Seq

model_id = "openai/whisper-large-v3"
quanto_config = QuantoConfig(weights="int8")

model = AutoModelForSpeechSeq2Seq.from_pretrained(
   model_id,
   torch_dtype=torch.float16,
   device_map="cuda",
   quantization_config=quanto_config
)

Take a look at this notebook for an entire tutorial on methods to properly use quanto with the transformers integration!

Contributing to quanto

Contributions to quanto are very much welcomed, especially in the next areas:

optimized kernels for quanto operations targeting specific devices,
Post-Training-Quantization optimizers to get well the accuracy lost during quantization,
helper classes for transformers or diffusers models.

Source link

a PyTorch quantization backend for Optimum

Quantization workflow

Performance

Integration in transformers

Contributing to quanto

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Benchmarking Text-to-Speech Models within the Wild

StarCoder2 and The Stack v2

Text-Generation Pipeline on Intel® Gaudi® 2 AI Accelerator

Mastering Non-Linear Data: A Guide to Scikit-Learn’s SplineTransformer

Multi-Agent Warehouse AI Command Layer Enables Operational Excellence and Supply Chain Intelligence

a PyTorch quantization backend for Optimum

Quantization workflow

Performance

Integration in transformers

Contributing to quanto

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.