Introduction
Language models have gotten larger on a regular basis. On the time of this writing, PaLM has 540B parameters, OPT, GPT-3, and BLOOM have around 176B parameters, and we’re trending towards even larger models. Below is a diagram showing the dimensions of some recent language models.
Due to this fact, these models are hard to run on easily accessible devices. For instance, simply to do inference on BLOOM-176B, you would want to have 8x 80GB A100 GPUs (~$15k each). To fine-tune BLOOM-176B, you’d need 72 of those GPUs! Much larger models, like PaLM would require much more resources.
Because these huge models require so many GPUs to run, we’d like to seek out ways to cut back these requirements while preserving the model’s performance. Various technologies have been developed that attempt to shrink the model size, you could have heard of quantization and distillation, and there are lots of others.
After completing the training of BLOOM-176B, we at HuggingFace and BigScience were on the lookout for ways to make this big model easier to run on less GPUs. Through our BigScience community we were made aware of research on Int8 inference that doesn’t degrade predictive performance of enormous models and reduces the memory footprint of enormous models by an element or 2x. Soon we began collaboring on this research which ended with a full integration into Hugging Face transformers. With this blog post, we provide LLM.int8() integration for all Hugging Face models which we explain in additional detail below. If you must read more about our research, you may read our paper, LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale.
This text focuses on giving a high-level overview of this quantization technology, outlining the difficulties in incorporating it into the transformers library, and drawing up the long-term goals of this partnership.
Here you’ll learn what exactly make a big model use a lot memory? What makes BLOOM 350GB? Let’s begin by progressively going over a couple of basic premises.
Common data types utilized in Machine Learning
We start with the fundamental understanding of various floating point data types, that are also known as “precision” within the context of Machine Learning.
The scale of a model is set by the variety of its parameters, and their precision, typically one in every of float32, float16 or bfloat16 (image below from: https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/).
Float32 (FP32) stands for the standardized IEEE 32-bit floating point representation. With this data type it is feasible to represent a big selection of floating numbers. In FP32, 8 bits are reserved for the “exponent”, 23 bits for the “mantissa” and 1 bit for the sign of the number. Along with that, a lot of the hardware supports FP32 operations and directions.
Within the float16 (FP16) data type, 5 bits are reserved for the exponent and 10 bits are reserved for the mantissa. This makes the representable range of FP16 numbers much lower than FP32. This exposes FP16 numbers to the chance of overflowing (attempting to represent a number that could be very large) and underflowing (representing a number that could be very small).
For instance, in case you do 10k * 10k you find yourself with 100M which is just not possible to represent in FP16, as the biggest number possible is 64k. And thus you’d find yourself with NaN (Not a Number) result and if you’ve gotten sequential computation like in neural networks, all of the prior work is destroyed.
Normally, loss scaling is used to beat this issue, but it surely doesn’t all the time work well.
A brand new format, bfloat16 (BF16), was created to avoid these constraints. In BF16, 8 bits are reserved for the exponent (which is identical as in FP32) and seven bits are reserved for the fraction.
Which means in BF16 we will retain the identical dynamic range as FP32. But we lose 3 bits of precision with respect to FP16. Now there is completely no problem with huge numbers, however the precision is worse than FP16 here.
Within the Ampere architecture, NVIDIA also introduced TensorFloat-32 (TF32) precision format, combining the dynamic range of BF16 and precision of FP16 to only use 19 bits. It’s currently only used internally during certain operations.
Within the machine learning jargon FP32 known as full precision (4 bytes), while BF16 and FP16 are known as half-precision (2 bytes).
On top of that, the int8 (INT8) data type consists of an 8-bit representation that may store 2^8 different values (between [0, 255] or [-128, 127] for signed integers).
While, ideally the training and inference needs to be done in FP32, it is 2 times slower than FP16/BF16 and due to this fact a mixed precision approach is used where the weights are held in FP32 as a precise “fundamental weights” reference, while computation in a forward and backward pass are done for FP16/BF16 to boost training speed. The FP16/BF16 gradients are then used to update the FP32 fundamental weights.
During training, the fundamental weights are all the time stored in FP32, but in practice, the half-precision weights often provide similar quality during inference as their FP32 counterpart — a precise reference of the model is simply needed when it receives multiple gradient updates. This implies we will use the half-precision weights and use half the GPUs to perform the identical end result.
To calculate the model size in bytes, one multiplies the variety of parameters by the dimensions of the chosen precision in bytes. For instance, if we use the bfloat16 version of the BLOOM-176B model, we’ve 176*10**9 x 2 bytes = 352GB! As discussed earlier, this is kind of a challenge to suit into a couple of GPUs.
But what if we will store those weights with less memory using a distinct data type? A technique called quantization has been used widely in Deep Learning.
Introduction to model quantization
Experimentially, we’ve discovered that as a substitute of using the 4-byte FP32 precision, we will get an almost an identical inference end result with 2-byte BF16/FP16 half-precision, which halves the model size. It would be amazing to chop it further, however the inference quality end result starts to drop dramatically at lower precision.
To remediate that, we introduce 8-bit quantization. This method uses 1 / 4 precision, thus needing just one/4th of the model size! However it’s not done by just dropping one other half of the bits.
Quantization is finished by essentially “rounding” from one data type to a different. For instance, if one data type has the range 0..9 and one other 0..4, then the worth “4” in the primary data type can be rounded to “2” within the second data type. Nonetheless, if we’ve the worth “3” in the primary data type, it lies between 1 and a pair of of the second data type, then we might often round to “2”. This shows that each values “4” and “3” of the primary data type have the identical value “2” within the second data type. This highlights that quantization is a loud process that may result in information loss, a type of lossy compression.
The 2 commonest 8-bit quantization techniques are zero-point quantization and absolute maximum (absmax) quantization. Zero-point quantization and absmax quantization map the floating point values into more compact int8 (1 byte) values. First, these methods normalize the input by scaling it by a quantization constant.
For instance, in zero-point quantization, if my range is -1.0…1.0 and I would like to quantize into the range -127…127, I would like to scale by the factor of 127 after which round it into the 8-bit precision. To retrieve the unique value, you would want to divide the int8 value by that very same quantization factor of 127. For instance, the worth 0.3 can be scaled to 0.3*127 = 38.1. Through rounding, we get the worth of 38. If we reverse this, we get 38/127=0.2992 – we’ve a quantization error of 0.008 in this instance. These seemingly tiny errors are likely to accumulate and grow as they get propagated through the model’s layers and end in performance degradation.
(Image taken from: this blogpost )
Now let’s take a look at the small print of absmax quantization. To calculate the mapping between the fp16 number and its corresponding int8 number in absmax quantization, you’ve gotten to first divide by absolutely the maximum value of the tensor after which multiply by the overall range of the information type.
For instance, let’s assume you must apply absmax quantization in a vector that comprises [1.2, -0.5, -4.3, 1.2, -3.1, 0.8, 2.4, 5.4]. You extract absolutely the maximum of it, which is 5.4 on this case. Int8 has a variety of [-127, 127], so we divide 127 by 5.4 and acquire 23.5 for the scaling factor. Due to this fact multiplying the unique vector by it gives the quantized vector [28, -12, -101, 28, -73, 19, 56, 127].
To retrieve the newest, one can just divide in full precision the int8 number with the quantization factor, but because the result above is “rounded” some precision might be lost.
For an unsigned int8, we might subtract the minimum and scale by absolutely the maximum. That is near what zero-point quantization does. It’s is comparable to a min-max scaling however the latter maintains the worth scales in such a way that the worth “0” is all the time represented by an integer with none quantization error.
These tricks will be combined in several ways, for instance, row-wise or vector-wise quantization, on the subject of matrix multiplication for more accurate results. the matrix multiplication, A*B=C, as a substitute of normal quantization that normalize by a absolute maximum value per tensor, vector-wise quantization finds absolutely the maximum of every row of A and every column of B. Then we normalize A and B by dividing these vectors. We then multiply A*B to get C. Finally, to get back the FP16 values, we denormalize by computing the outer product of absolutely the maximum vector of A and B. More details on this system will be present in the LLM.int8() paper or within the blog post about quantization and emergent features on Tim’s blog.
While these basic techniques enable us to quanitize Deep Learning models, they typically result in a drop in accuracy for larger models. The LLM.int8() implementation that we integrated into Hugging Face Transformers and Speed up libraries is the primary technique that doesn’t degrade performance even for giant models with 176B parameters, corresponding to BLOOM.
A delicate summary of LLM.int8(): zero degradation matrix multiplication for Large Language Models
In LLM.int8(), we’ve demonstrated that it’s crucial to understand the scale-dependent emergent properties of transformers to be able to understand why traditional quantization fails for giant models. We reveal that performance deterioration is brought on by outlier features, which we explain in the following section. The LLM.int8() algorithm itself will be explain as follows.
In essence, LLM.int8() seeks to finish the matrix multiplication computation in three steps:
- From the input hidden states, extract the outliers (i.e. values which are larger than a certain threshold) by column.
- Perform the matrix multiplication of the outliers in FP16 and the non-outliers in int8.
- Dequantize the non-outlier results and add each outlier and non-outlier results together to receive the total end in FP16.
These steps will be summarized in the next animation:
The importance of outlier features
A worth that’s outside the range of some numbers’ global distribution is usually known as an outlier. Outlier detection has been widely used and covered in the present literature, and having prior knowledge of the distribution of your features helps with the duty of outlier detection. More specifically, we’ve observed that classic quantization at scale fails for transformer-based models >6B parameters. While large outlier features are also present in smaller models, we observe that a certain threshold these outliers from highly systematic patterns across transformers that are present in every layer of the transformer. For more details on these phenomena see the LLM.int8() paper and emergent features blog post.
As mentioned earlier, 8-bit precision is incredibly constrained, due to this fact quantizing a vector with several big values can produce wildly erroneous results. Moreover, due to a built-in characteristic of the transformer-based architecture that links all the weather together, these errors are likely to compound as they get propagated across multiple layers. Due to this fact, mixed-precision decomposition has been developed to facilitate efficient quantization with such extreme outliers. It’s discussed next.
Contained in the MatMul
Once the hidden states are computed we extract the outliers using a custom threshold and we decompose the matrix into two parts as explained above. We found that extracting all outliers with magnitude 6 or greater in this fashion recoveres full inference performance. The outlier part is finished in fp16 so it’s a classic matrix multiplication, whereas the 8-bit matrix multiplication is finished by quantizing the weights and hidden states into 8-bit precision using vector-wise quantization — that’s, row-wise quantization for the hidden state and column-wise quantization for the load matrix.
After this step, the outcomes are dequantized and returned in half-precision to be able to add them to the primary matrix multiplication.
What does 0 degradation mean?
How can we properly evaluate the performance degradation of this method? How much quality will we lose by way of generation when using 8-bit models?
We ran several common benchmarks with the 8-bit and native models using lm-eval-harness and reported the outcomes.
For OPT-175B:
| benchmarks | – | – | – | – | difference – value |
|---|---|---|---|---|---|
| name | metric | value – int8 | value – fp16 | std err – fp16 | – |
| hellaswag | acc_norm | 0.7849 | 0.7849 | 0.0041 | 0 |
| hellaswag | acc | 0.5921 | 0.5931 | 0.0049 | 0.001 |
| piqa | acc | 0.7965 | 0.7959 | 0.0094 | 0.0006 |
| piqa | acc_norm | 0.8101 | 0.8107 | 0.0091 | 0.0006 |
| lambada | ppl | 3.0142 | 3.0152 | 0.0552 | 0.001 |
| lambada | acc | 0.7464 | 0.7466 | 0.0061 | 0.0002 |
| winogrande | acc | 0.7174 | 0.7245 | 0.0125 | 0.0071 |
For BLOOM-176:
| benchmarks | – | – | – | – | difference – value |
|---|---|---|---|---|---|
| name | metric | value – int8 | value – bf16 | std err – bf16 | – |
| hellaswag | acc_norm | 0.7274 | 0.7303 | 0.0044 | 0.0029 |
| hellaswag | acc | 0.5563 | 0.5584 | 0.005 | 0.0021 |
| piqa | acc | 0.7835 | 0.7884 | 0.0095 | 0.0049 |
| piqa | acc_norm | 0.7922 | 0.7911 | 0.0095 | 0.0011 |
| lambada | ppl | 3.9191 | 3.931 | 0.0846 | 0.0119 |
| lambada | acc | 0.6808 | 0.6718 | 0.0065 | 0.009 |
| winogrande | acc | 0.7048 | 0.7048 | 0.0128 | 0 |
We indeed observe 0 performance degradation for those models because the absolute difference of the metrics are all below the usual error (apart from BLOOM-int8 which is barely higher than the native model on lambada). For a more detailed performance evaluation against state-of-the-art approaches, take a have a look at the paper!
Is it faster than native models?
The fundamental purpose of the LLM.int8() method is to make large models more accessible without performance degradation. But the tactic can be less useful if it is rather slow. So we benchmarked the generation speed of multiple models.
We discover that BLOOM-176B with LLM.int8() is about 15% to 23% slower than the fp16 version – which remains to be quite acceptable. We found larger slowdowns for smaller models, like T5-3B and T5-11B. We worked hard to hurry up these small models. Inside a day, we could improve inference per token from 312 ms to 173 ms for T5-3B and from 45 ms to 25 ms for T5-11B. Moreover, issues were already identified, and LLM.int8() will likely be faster still for small models in upcoming releases. For now, the present numbers are within the table below.
| Precision | Variety of parameters | Hardware | Time per token in milliseconds for Batch Size 1 | Time per token in milliseconds for Batch Size 8 | Time per token in milliseconds for Batch Size 32 |
|---|---|---|---|---|---|
| bf16 | 176B | 8xA100 80GB | 239 | 32 | 9.9 |
| int8 | 176B | 4xA100 80GB | 282 | 37.5 | 10.2 |
| bf16 | 176B | 14xA100 40GB | 285 | 36.5 | 10.4 |
| int8 | 176B | 5xA100 40GB | 367 | 46.4 | oom |
| fp16 | 11B | 2xT4 15GB | 11.7 | 1.7 | 0.5 |
| int8 | 11B | 1xT4 15GB | 43.5 | 5.3 | 1.3 |
| fp32 | 3B | 2xT4 15GB | 45 | 7.2 | 3.1 |
| int8 | 3B | 1xT4 15GB | 312 | 39.1 | 10.2 |
The three models are BLOOM-176B, T5-11B and T5-3B.
Hugging Face transformers integration nuances
Next let’s discuss the specifics of the Hugging Face transformers integration. Let’s take a look at the usage and the common offender you might encounter while attempting to set things up.
Usage
The module answerable for the entire magic described on this blog post known as Linear8bitLt and you may easily import it from the bitsandbytes library. It’s derived from a classic torch.nn Module and will be easily used and deployed in your architecture with the code described below.
Here’s a step-by-step example of the next use case: to illustrate you must convert a small model in int8 using bitsandbytes.
- First we’d like the proper imports below!
import torch
import torch.nn as nn
import bitsandbytes as bnb
from bnb.nn import Linear8bitLt
- Then you definitely can define your personal model. Note you could convert a checkpoint or model of any precision to 8-bit (FP16, BF16 or FP32) but, currently, the input of the model needs to be FP16 for our Int8 module to work. So we treat our model here as a fp16 model.
fp16_model = nn.Sequential(
nn.Linear(64, 64),
nn.Linear(64, 64)
)
- To illustrate you’ve gotten trained your model in your favorite dataset and task! Now time to avoid wasting the model:
[... train the model ...]
torch.save(fp16_model.state_dict(), "model.pt")
- Now that your
state_dictis saved, allow us to define an int8 model:
int8_model = nn.Sequential(
Linear8bitLt(64, 64, has_fp16_weights=False),
Linear8bitLt(64, 64, has_fp16_weights=False)
)
Here it is rather vital so as to add the flag has_fp16_weights. By default, this is about to True which is used to coach in mixed Int8/FP16 precision. Nonetheless, we’re enthusiastic about memory efficient inference for which we’d like to make use of has_fp16_weights=False.
- Now time to load your model in 8-bit!
int8_model.load_state_dict(torch.load("model.pt"))
int8_model = int8_model.to(0)
Note that the quantization step is finished within the second line once the model is about on the GPU. In case you print int8_model[0].weight before calling the .to function you get:
int8_model[0].weight
Parameter containing:
tensor([[ 0.0031, -0.0438, 0.0494, ..., -0.0046, -0.0410, 0.0436],
[-0.1013, 0.0394, 0.0787, ..., 0.0986, 0.0595, 0.0162],
[-0.0859, -0.1227, -0.1209, ..., 0.1158, 0.0186, -0.0530],
...,
[ 0.0804, 0.0725, 0.0638, ..., -0.0487, -0.0524, -0.1076],
[-0.0200, -0.0406, 0.0663, ..., 0.0123, 0.0551, -0.0121],
[-0.0041, 0.0865, -0.0013, ..., -0.0427, -0.0764, 0.1189]],
dtype=torch.float16)
Whereas in case you print it after the second line’s call you get:
int8_model[0].weight
Parameter containing:
tensor([[ 3, -47, 54, ..., -5, -44, 47],
[-104, 40, 81, ..., 101, 61, 17],
[ -89, -127, -125, ..., 120, 19, -55],
...,
[ 82, 74, 65, ..., -49, -53, -109],
[ -21, -42, 68, ..., 13, 57, -12],
[ -4, 88, -1, ..., -43, -78, 121]],
device="cuda:0", dtype=torch.int8, requires_grad=True)
The weights values are “truncated” as we’ve seen when explaining quantization within the previous sections. Also, the values appear to be distributed between [-127, 127].
You may additionally wonder learn how to retrieve the FP16 weights to be able to perform the outlier MatMul in fp16? You may simply do:
(int8_model[0].weight.CB * int8_model[0].weight.SCB) / 127
And you’re going to get:
tensor([[ 0.0028, -0.0459, 0.0522, ..., -0.0049, -0.0428, 0.0462],
[-0.0960, 0.0391, 0.0782, ..., 0.0994, 0.0593, 0.0167],
[-0.0822, -0.1240, -0.1207, ..., 0.1181, 0.0185, -0.0541],
...,
[ 0.0757, 0.0723, 0.0628, ..., -0.0482, -0.0516, -0.1072],
[-0.0194, -0.0410, 0.0657, ..., 0.0128, 0.0554, -0.0118],
[-0.0037, 0.0859, -0.0010, ..., -0.0423, -0.0759, 0.1190]],
device="cuda:0")
Which is close enough to the unique FP16 values (2 print outs up)!
- Now you may safely infer using your model by ensuring your input is on the proper GPU and is in FP16:
input_ = torch.randn((1, 64), dtype=torch.float16)
hidden_states = int8_model(input_.to(torch.device('cuda', 0)))
Try the instance script for the total minimal code!
As a side note, you ought to be aware that these modules differ barely from the nn.Linear modules in that their parameters come from the bnb.nn.Int8Params class slightly than the nn.Parameter class. You will see later that this presented an extra obstacle on our journey!
Now the time has come to grasp learn how to integrate that into the transformers library!
speed up is all you wish
When working with huge models, the speed up library includes various helpful utilities. The init_empty_weights method is particularly helpful because any model, no matter size, could also be initialized with this method as a context manager without allocating any memory for the model weights.
import torch.nn as nn
from speed up import init_empty_weights
with init_empty_weights():
model = nn.Sequential([nn.Linear(100000, 100000) for _ in range(1000)])
The initialized model might be placed on PyTorch’s meta device, an underlying mechanism to represent shape and dtype without allocating memory for storage. How cool is that?
Initially, this function known as contained in the .from_pretrained function and overrides all parameters to torch.nn.Parameter. This might not fit our requirement since we would like to maintain the Int8Params class in our case for Linear8bitLt modules as explained above. We managed to repair that on the next PR that modifies:
module._parameters[name] = nn.Parameter(module._parameters[name].to(torch.device("meta")))
to
param_cls = type(module._parameters[name])
kwargs = module._parameters[name].__dict__
module._parameters[name] = param_cls(module._parameters[name].to(torch.device("meta")), **kwargs)
Now that that is fixed, we will easily leverage this context manager and play with it to interchange all nn.Linear modules to bnb.nn.Linear8bitLt at no memory cost using a custom function!
def replace_8bit_linear(model, threshold=6.0, module_to_not_convert="lm_head"):
for name, module in model.named_children():
if len(list(module.children())) > 0:
replace_8bit_linear(module, threshold, module_to_not_convert)
if isinstance(module, nn.Linear) and name != module_to_not_convert:
with init_empty_weights():
model._modules[name] = bnb.nn.Linear8bitLt(
module.in_features,
module.out_features,
module.bias is not None,
has_fp16_weights=False,
threshold=threshold,
)
return model
This function recursively replaces all nn.Linear layers of a given model initialized on the meta device and replaces them with a Linear8bitLt module. The attribute has_fp16_weights needs to be set to False to be able to directly load the weights in int8 along with the quantization statistics.
We also discard the alternative for some modules (here the lm_head) since we would like to maintain the newest of their native precision for more precise and stable results.
However it is not over yet! The function above is executed under the init_empty_weights context manager which implies that the brand new model might be still within the meta device.
For models which are initialized under this context manager, speed up will manually load the parameters of every module and move them to the proper devices.
In bitsandbytes, setting a Linear8bitLt module’s device is a vital step (in case you are curious, you may check the code snippet here) as we’ve seen in our toy script.
Here the quantization step fails when calling it twice. We needed to give you an implementation of speed up‘s set_module_tensor_to_device function (termed as set_module_8bit_tensor_to_device) to be certain that we do not call it twice. Let’s discuss this intimately within the section below!
Be very careful on learn how to set devices with speed up
Here we played a really delicate balancing act with the speed up library!
When you load your model and set it on the proper devices, sometimes you continue to have to call set_module_tensor_to_device to dispatch the model with hooks on all devices. This is finished contained in the dispatch_model function from speed up, which involves potentially calling .to several times and is something we would like to avoid.
2 Pull Requests were needed to attain what we wanted! The initial PR proposed here broke some tests but this PR successfully fixed every part!
Wrapping all of it up
Due to this fact the final word recipe is:
- Initialize a model within the
metadevice with the proper modules - Set the parameters one after the other on the proper GPU device and be certain that you never do that procedure twice!
- Put latest keyword arguments in the proper place all over the place, and add some nice documentation
- Add very extensive tests! Check our tests here for more details
This may occasionally sound quite easy, but we went through many hard debugging sessions together, often times involving CUDA kernels!
All said and done, this integration adventure was very fun; from deep diving and doing a little “surgery” on different libraries to aligning every part and making it work!
Now time to see learn how to profit from this integration and learn how to successfully use it in transformers!
Learn how to use it in transformers
Hardware requirements
8-bit tensor cores usually are not supported on the CPU. bitsandbytes will be run on 8-bit tensor core-supported hardware, that are Turing and Ampere GPUs (RTX 20s, RTX 30s, A40-A100, T4+). For instance, Google Colab GPUs are frequently NVIDIA T4 GPUs, and their latest generation of GPUs does support 8-bit tensor cores. Our demos are based on Google Colab so check them out below!
Installation
Just install the newest version of the libraries using the commands below (be certain that that you just are using python>=3.8) and run the commands below to check out
pip install speed up
pip install bitsandbytes
pip install git+https://github.com/huggingface/transformers.git
Example demos – running T5 11b on a Google Colab
Try the Google Colab demos for running 8bit models on a BLOOM-3B model!
Here is the demo for running T5-11B. The T5-11B model checkpoint is in FP32 which uses 42GB of memory and doesn’t fit on Google Colab. With our 8-bit modules it only uses 11GB and matches easily:
Or this demo for BLOOM-3B:
Scope of improvements
This approach, in our opinion, greatly improves access to very large models. With no performance degradation, it enables users with less compute to access models that were previously inaccessible.
We have found several areas for improvement that will be worked on in the long run to make this method even higher for giant models!
Faster inference speed for smaller models
As we’ve seen within the the benchmarking section, we could improve the runtime speed for small model (<=6B parameters) by an element of virtually 2x. Nonetheless, while the inference speed is strong for giant models like BLOOM-176B there are still improvements available for small models. We already identified the problems and sure get well same performance as fp16, or get small speedups. You will notice these changes being integrated inside the following couple of weeks.
Support for Kepler GPUs (GTX 1080 etc)
While we support all GPUs from the past 4 years, some old GPUs like GTX 1080 still see heavy use. While these GPUs don’t have Int8 tensor cores, they do have Int8 vector units (a type of “weak” tensor core). As such, these GPUs may experience Int8 acceleration. Nonetheless, it requires a entire different stack of software for fast inference. While we do plan to integrate support for Kepler GPUs to make the LLM.int8() feature more widely available, it’s going to take a while to comprehend this because of its complexity.
Saving 8-bit state dicts on the Hub
8-bit state dicts cannot currently be loaded directly into the 8-bit model after being pushed on the Hub. That is because of the incontrovertible fact that the statistics (remember weight.CB and weight.SCB) computed by the model usually are not currently stored or taken under consideration contained in the state dict, and the Linear8bitLt module doesn’t support this feature yet.
We expect that having the power to avoid wasting that and push it to the Hub might contribute to greater accessibility.
CPU support
CPU devices don’t support 8-bit cores, as was stated initially of this blogpost. Can we, nonetheless, get past that? Running this module on CPUs would also significantly improve usability and accessibility.
Scaling up on other modalities
Currently, language models dominate very large models. Leveraging this method on very large vision, audio, and multi-modal models is perhaps an interesting thing to do for higher accessibility in the approaching years as these models turn out to be more accessible.
Credits
Huge because of the next who contributed to enhance the readability of the article in addition to contributed in the combination procedure in transformers (listed in alphabetic order):
JustHeuristic (Yozh),
Michael Benayoun,
Stas Bekman,
Steven Liu,
Sylvain Gugger,
Tim Dettmers








