Boost 2-Bit LLM Accuracy with EoRA

-

is one among the important thing techniques for reducing the memory footprint of huge language models (LLMs). It really works by converting the information variety of model parameters from higher-precision formats comparable to 32-bit floating point (FP32) or 16-bit floating point (FP16/BF16) to lower-precision integer formats, typically INT8 or INT4. For instance, quantizing a model to 4-bit means each parameter uses only 0.5 bytes, in comparison with 4 bytes in FP32.

Post-training quantization methods like GPTQ and AWQ can dramatically reduce the scale of huge models. A model like Llama 3 with 70 billion parameters can occupy around 140 GB in FP16, but this could be reduced to roughly 40 GB using 4-bit quantization, while still maintaining strong performance on downstream tasks.

Nonetheless, despite this substantial reduction, such models still exceed the memory capability of most consumer-grade GPUs, which usually offer 24 GB to 32 GB of VRAM. To make these models truly accessible, quantization to even lower bitwidths, comparable to 2-bit, is required. While recent advances in low-bit quantization are promising, achieving stable and accurate 2-bit quantization stays a major challenge.

In this text, we review a method called EoRA that helps compensate for quantization-induced errors. EoRA is a method, meaning it could possibly be applied quickly and efficiently to any model, even the biggest ones. We’ll check how EoRA works and reveal how it could possibly significantly improve the performance of 2-bit quantized models, bringing them near the accuracy of their full-precision counterparts while being as much as 5.5x smaller.

We’ll analyze experimental results obtained using large models comparable to Qwen3-32B and Qwen2.5-72B, each quantized to 2-bit using state-of-the-art quantization techniques, for instance the effectiveness of EoRA.

Diving into the Eigenspace in Search of an Adapter

Post-training quantization or, more generally, compression goals to cut back model size or inference cost by minimizing the output difference between the unique weights and compressed weights using only a small calibration dataset.

Most quantization methods are framed layer-wise, however the alternative of compression formats is rigid and limits flexibility across diverse deployment needs.

To bypass format constraints and improve accuracy, previous work, comparable to QLoRA [1] and HQQ+ [2], directly fine-tuned a Lora adapter on top of the frozen quantized models.

It is usually possible to reframe compression as a problem: given a compressed model, introduce low-rank residual paths that specifically correct compression errors.

An easy method uses SVD to decompose the compression error:

[Delta W_l = W_l – hat{W}_l]

into

[U_l Sigma_l V_l^T]

forming low-rank approximations via two matrices:

[B_l = U_l Sigma_l ]

[A_l = V_l^T]

where and are the usual tensors of a LoRA adapter.

Nonetheless, plain SVD has two limitations: it doesn’t minimize the unique layerwise compression loss directly, and it allocates capability uniformly across all error components, ignoring the various importance of various parts of the model.

To handle this, NVIDIA proposes EoRA [3].

EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation

EoRA first projects the compression error into the eigenspace defined by the input activation covariance:

[tilde{X} tilde{X}^T]

where is the typical activation over the calibration set. Then, by performing eigendecomposition, we get:

[tilde{X} tilde{X}^T = Q Lambda Q^T]

The compression error is projected as:

[Delta W’ = Delta W Q’]

where . Then SVD is applied on to provide a low-rank approximation, and the result’s projected back to the unique space, adjusting the low-rank aspects accordingly.

This eigenspace projection changes the optimization objective: it weights the importance of various error components in keeping with their contribution to the layerwise output (via eigenvalues), making the approximation more efficient. It could actually be computed quickly with none training, requires only calibration activations, and doesn’t introduce extra inference latency. Furthermore, the derivation shows that this approach results in a direct minimization of the layerwise compression loss, not only the raw weight error.

Analytically, truncating a singular value within the projected space corresponds to minimizing the true compression error under reasonable assumptions in regards to the calibration activations.

Of their paper, NVIDIA presents a wide selection of strong results showing that EoRA can significantly boost the accuracy of quantized models. Nonetheless, their experiments focus totally on older Quantization methods like GPTQ and are limited to mid-sized LLMs, as much as 13B parameters, at 3-bit and 4-bit precisions.

This leaves an open query:

Let’s discover.

Calibrating an EoRA Adapter

Suppose we have now quantized models that show significantly degraded performance in comparison with their full-precision counterparts on certain tasks. Our goal is to cut back this performance gap using EoRA.

For the experiments, I used Qwen2.5-72B Instruct and Qwen3-32B, each quantized to 2-bit using AutoRound (Apache 2.0 license), a state-of-the-art quantization algorithm developed by Intel. AutoRound leverages SignSGD optimization to fine-tune quantization, and is especially effective for low-bit settings.

All of the models I made can be found here (Apache 2.0 license):

The two-bit models were quantized with a bunch size of 32, apart from which used a bunch size of 128. A bigger group size reduces model size by storing less quantization metadata, nevertheless it introduces greater quantization error.

I evaluated the models on IFEval, a benchmark that measures instruction-following capabilities. Results showed a noticeable drop in performance for the quantized versions.

Image by the writer

To compensate for this degradation, I applied an EoRA adapter using the implementation provided within the GPTQModel library (licensed under Apache 2.0). The combination is simple. For those who’re inquisitive about the way it’s implemented in PyTorch, the codebase is compact, clean, and simple to follow:

  • GPTQModel’s EoRA implementation: eora.py

EoRA requires a calibration dataset. Ideally, this dataset should reflect the model’s intended use case. Nonetheless, since we don’t have a selected goal task on this context and aim to preserve the model’s general capabilities, I used 1,024 randomly sampled examples from the C4 dataset (licensed under ODC-BY).

One other key parameter is the LoRA rank, which greatly influences the effectiveness of the EoRA adapter. Its optimal value will depend on the model architecture, the goal task, and the calibration data. The next rank may yield higher performance but risks overfitting to the calibration set. It also increases the scale of the adapter, counterproductive when the general goal of quantization is to cut back memory usage. Conversely, a lower rank keeps the adapter lightweight but won’t capture enough information to effectively compensate for quantization errors.

In my experiments, I tested LoRA ranks of 32, 64, and 256.

Below is the code used to create the EoRA adapter with GPTQModel:

from gptqmodel import GPTQModel
from gptqmodel.adapter.adapter import Lora
from datasets import load_dataset

calibration_dataset = load_dataset(
      "allenai/c4",
      data_files="en/c4-train.00001-of-01024.json.gz",
      split="train", download_mode="force_redownload"
    ).select(range(1024))["text"]

eora_adapter_path = "Qwen3-32B-autoround-2bit-gptq-r256"
model_path = "kaitchup/Qwen3-32B-autoround-2bit-gptq"
eora = Lora(
    path=eora_adapter_path,
    rank=256,
)

GPTQModel.adapter.generate(
        adapter=eora,
        model_id_or_path="Qwen/Qwen3-32B",
        quantized_model_id_or_path=model_path,
        calibration_dataset=calibration_dataset,
        calibration_dataset_concat_size=0,
        auto_gc=False)

Using an NVIDIA A100 GPU on RunPod (referral link), it took roughly 4 hours to generate the EoRA adapter for the model Qwen3-32B-autoround-2bit-gptq.

All EoRA adapters created for these models are publicly available (Apache 2.0 license):

Evaluating EoRA Adapters for 2-bit LLMs

Let’s evaluate the effect of the EoRA adapters. Do they improve the accuracy of the 2-bit models?

Image by the writer

It really works!

The improvements are particularly notable for Qwen3-14B and Qwen3-32B. As an illustration, applying EoRA to Qwen3-32B, quantized to 2-bit with a bunch size of 128, resulted in an accuracy gain of nearly 7.5 points. Increasing the LoRA rank, from 32 to 64, also led to improvements, highlighting the impact of rank on performance.

EoRA can also be effective on larger models like Qwen2.5-72B, though the gains are more modest. Lower-rank adapters showed little to no profit on this model; it wasn’t until I increased the rank to 256 that significant improvements started to appear.

Memory Consumption of EoRA

Using the EoRA adapter during inference ends in the next increase in memory consumption:

Image by the writer

The overhead is usually negligible. As an illustration for 2-bit Qwen3-14B, the adapters only add 257 MB and 514 MB to the whole model size, with ranks of 32 and 64. With larger ranks, using an EoRA adapter becomes questionable as the whole memory consumption may surpass the memory consumption of the identical model quantized at a better precision. As an illustration, 2-bit Qwen2.5 72B with an EoRA adapter of rank 256 is larger than 3-bit Qwen2.5 72B.

Conclusion

EoRA works. We’ve confirmed that it’s an easy yet effective method for compensating quantization errors, even at 2-bit precision. It’s intuitive, training-free, and delivers meaningful performance gains. That said, there are a couple of trade-offs to contemplate:

  • Rank search: Finding the optimal LoRA rank requires experimentation. It’s difficult to predict upfront whether a rank of 32 shall be sufficient or whether a better rank, like 256, will cause overfitting. The optimal value will depend on the model, calibration data, and goal task.
  • Increased memory consumption: The goal of quantization is to cut back memory usage, often for highly constrained environments. While EoRA adapters are relatively lightweight at lower ranks, they do barely increase memory consumption, particularly at higher ranks, reducing the general efficiency of 2-bit quantization.

Looking ahead, NVIDIA’s paper also demonstrates that EoRA adapters make excellent starting points for QLoRA fine-tuning. In other words, in the event you plan to fine-tune a 2-bit model using QLoRA, initializing from an EoRA-adapted model can lead to higher results with less training effort. I’ve written about fine-tuning adapters for GPTQ model last yr, in my newsletter:

QLoRA with AutoRound: Cheaper and Higher LLM Wonderful-tuning on Your GPU

The principal difference is that as an alternative of initializing the adapter from scratch, we’d load the EoRA adapter. This adapter shall be fine-tuned.

References

[1] Dettmers et al, QLoRA: Efficient Finetuning of Quantized LLMs (2023), arXiv

[2] Badri and Shaji, Towards 1-bit Machine Learning Models (2024), Mobius Labs’ Blog

[3] Liu et al., EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation (2024), arXiv

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x