The combination of GaLore into the training of huge language models (LLMs) marks a big advancement in the sector of deep learning, particularly when it comes to memory efficiency and the democratization of AI research. By allowing for the training of billion-parameter models on consumer-grade hardware, reducing memory footprint in optimizer states, and leveraging advanced projection matrix techniques, GaLore opens latest horizons for researchers and practitioners with limited access to high-end computational resources.
Scaling LLMs with Consumer-Grade Hardware
The aptitude of GaLore to facilitate the training of models with as much as 7 billion parameters, corresponding to those based on the Llama architecture, on consumer GPUs just like the NVIDIA RTX 4090, is groundbreaking. That is achieved by significantly reducing the memory requirements traditionally related to optimizer states and gradients through the training process. The approach leverages the inherent low-rank structure of gradients in deep neural networks, applying a projection that reduces the dimensionality of the info that should be stored and manipulated.
Memory Efficiency in Optimizer States
The optimizer state, especially in adaptive optimization algorithms like Adam, represents a significant slice of the memory footprint during model training. GaLore addresses this by projecting the gradients right into a lower-dimensional subspace before they’re processed by the optimizer. This not only reduces the memory required to store these states but additionally maintains the effectiveness of the optimization process.
The memory savings are substantial, with the authors reporting “greater than 82.5% reduction in memory for storing optimizer states during training”, making it feasible to coach larger models or use larger batch sizes inside the same memory constraints. When combined with 8-bit precision optimizers, these savings may be much more pronounced.
Subspace Switching and Advanced Projection Techniques
A critical component of GaLore’s effectiveness is its dynamic subspace switching mechanism, which allows the model to navigate through different low-rank subspaces throughout the training process. This ensures that the model shouldn’t be confined to a limited portion of the parameter space, thus preserving the capability for full-parameter learning. The choice on when and how you can switch subspaces is pivotal, with the frequency of those switches being a balance between maintaining a consistent optimization trajectory and adapting to the evolving landscape of the gradient’s low-rank structure.
The flexibility to dynamically adjust these projections in response to changes within the gradient structure is a potent tool within the GaLore arsenal, allowing for more nuanced control over the memory-optimization trade-offs inherent in training large models.
Combining GaLore with 8-bit Optimizers
The mix of GaLore with 8-bit precision optimizers represents a synergy that maximizes memory efficiency while maintaining the integrity and performance of the training process. 8-bit optimizers reduce the memory footprint by quantizing the optimizer states. When used together with GaLore’s projection mechanism, the result’s a highly memory-efficient training regime that doesn’t compromise on model accuracy or convergence speed.
This mix is especially effective in scenarios where memory is a critical bottleneck, corresponding to training large models on consumer-grade hardware or deploying models in memory-constrained environments. It enables using more complex models and bigger datasets inside the same hardware constraints, pushing the boundaries of what may be achieved with limited resources.
Implementation Details
Integrating 8-bit optimizers with GaLore for training large language models (LLMs) involves quantizing the gradients, weights, and optimizer states to 8-bit representations. This quantization process significantly reduces the memory footprint, enabling the training of larger models or using larger batch sizes inside the same memory constraints. The algorithmic details of this integration involve several key steps, a few of which might profit significantly from native CUDA implementation for efficiency gains. GaLore opens latest possibilities to integrate these techniques much more tightly with quantization and specialized parameterization of the matrices, which might result in further reductions in memory usage. We’re currently exploring this direction within the bitsandbytes library.
Algorithmic Overview of 8-bit Optimization with GaLore
Gradient Projection: GaLore projects the full-precision gradients right into a low-rank subspace using projection matrices. This step reduces the dimensionality of the gradients, that are then quantized to 8-bit format.
Quantization: The projected gradients, together with the model weights and optimizer states (corresponding to the moving averages in Adam), are quantized from 32-bit floating-point to 8-bit integer representations. This involves scaling the floating-point values to the 8-bit range and rounding them to the closest integer.
Optimizer Update: The 8-bit quantized gradients are used to update the model weights. This step involves de-quantizing the gradients back to floating-point format, applying the optimizer’s update rule (e.g., Adam’s moment update and parameter adjustment), after which quantizing the updated optimizer states back to 8-bit for storage.
De-quantization and Weight Update: The 8-bit quantized weights undergo de-quantization to a floating-point representation for processing, albeit retaining the 8-bit precision inherent to their quantized form attributable to the limited range of values. This step is required because standard operations in frameworks like PyTorch don’t support 8-bit integers, and such integer weights cannot accommodate gradients. While this approach doesn’t inherently enhance accuracy, it facilitates the sensible application and gradient computation of quantized weights inside the constraints of current deep learning libraries. Note that after de-quantization and before applying the burden update, GaLore employs yet another projection that projects de-quantized low-rank updates back to the unique space.
Use it with Hugging Face Transformers
To make use of GaLore optimizers with the Hugging Face transformers library, you first have to update it to a version that supports GaLore optimizers, by either installing the most recent update, i.e. pip install transformers>=4.39.0 or installing transformers from source.
Then install the galore-torch library with pip install galore-torch. Below is a full working example of GaLore with transformers, for pretraining Mistral-7B on the imdb dataset:
import torch
import datasets
from transformers import TrainingArguments, AutoConfig, AutoTokenizer, AutoModelForCausalLM
import trl
train_dataset = datasets.load_dataset('imdb', split='train')
args = TrainingArguments(
output_dir="./test-galore",
max_steps=100,
per_device_train_batch_size=2,
optim="galore_adamw",
optim_target_modules=["attn", "mlp"]
)
model_id = "mistralai/Mistral-7B-v0.1"
config = AutoConfig.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_config(config).to(0)
trainer = trl.SFTTrainer(
model=model,
args=args,
train_dataset=train_dataset,
dataset_text_field='text',
max_seq_length=512,
)
trainer.train()
TrainingArguments: Simply pass a legitimate optim_target_modules (it supports a single string, regex, or an inventory of strings or regexes) in addition to, for optim, a legitimate GaLore optimizer, corresponding to galore_adamw, galore_adamw_8bit, galore_adafactor – and also you’re good to go!
Layer-wise Updates
One other necessary point to say are the layer-wise optimizers (i.e. updating weights one layer at a time). Typically, the optimizer performs a single weight update for all layers after backpropagation. This is completed by storing the whole weight gradients in memory. By adopting layer-wise weight updates, we are able to further reduce the memory footprint during training. Under the hood, that is implemented with PyTorch post-accumulation hooks on the layers the users need to update.
To make use of this feature, simply append _layerwise to the optimizer names, for instance galore_adamw_layerwise.
Conclusion
GaLore, with its modern approach to leveraging the low-rank structure of gradients, represents a big step forward within the memory-efficient training of LLMs. By enabling the training of billion-parameter models on consumer-grade hardware, reducing the memory footprint of optimizer states through projection techniques, and allowing for dynamic subspace switching, GaLore democratizes access to large-scale model training. The compatibility of GaLore with 8-bit precision optimizers further enhances its utility, offering a pathway to training larger and more complex models without the necessity for specialised computational resources. This opens up latest possibilities for research and application in AI, making it an exciting time for practitioners and researchers alike.
Resources
Please seek advice from the unique paper. Twitter references: 1 2 3. The paper also draws comparisons between GaLore and ReLoRA, which is perhaps of interest to some readers. For readers with questions that remain unanswered, especially after review of the paper, or who would really like to constructively discuss the outcomes, please be at liberty to join the creator’s Slack community. For those taken with further releases along these lines, please follow Jiawei Zhao and Titus von Koeller (for information on the most recent bitsandbytes releases) in addition to Younes Belkada for the most recent and best infos on quantization-related topics inside and across the Hugging Face ecosystem.
