Home Artificial Intelligence Where did all of the memory go ?

Where did all of the memory go ?

1
Where did all of the memory go ?

Better ML

LLM finetuning version.

The last time I considered memory was after I had began using Chrome. Recently, after I was attempting to fine-tune a small LLM (:P) with model weights of approx ~13 GB on a A-100 40 GB card, it failed with the dreaded CUDA OOM error. Decreasing batch sizes to single digits and sequence length significantly didn’t help which led me to pondering : Where did all my memory go ?

Models have grown 1000x in variety of parameters within the last 5 years (BERT -> GPT4), but GPU memory has just 5x (V-100 16 GB -> A-100 80 GB).

Existing solutions resembling Data parallelism won’t assist in fitting models in fitting in device memory and model parallelism is difficult to implement and is just not very efficient throughput smart. Before diving into solutions, let’s work out what’s happening with the memory.

  • During training, the three foremost memory consumption sources are optimizer states, gradients, and parameters. Besides these activations and temporary buffers eat the remainder of the memory. Fragmented memory adds to the woes.
  • Let’s consider we’re training using AdamW optimizer.
    Model weights (parameters) = 4 bytes * num_params = N
    Optimizer tensors = 8 bytes * num_params = 2N
    Gradients = 4 bytes * num_params = N
    The whole memory for training is 4N !

  • Llama-7B model weight is around 12 GB. Which means we require ~48 GB+ GPU memory per card to finetune Llama-7B. The standard A-100 GPU card available on AWS has a memory of only 40 GB.
    Activations too eat GPU memory. These are depending on your batch size and sequence length (the simplest knobs to show).

  • In (DDP) training, each employee owns a duplicate of the model and processes a batch of information, finally it uses all-reduce to sum up gradients over different employees. While DDP has turn out to be highly regarded, it takes more GPU memory than it needs since the model weights and optimizer states are replicated across all DDP employees.
  • Should you cannot fit within the model parameters (p), optimizer (o), and gradients(g) in device, you can not train using DDP.
    . FSDP GPU memory footprint can be smaller than DDP across all employees on account of this.
  • Lower memory footprint makes training and fine-tuning of giant models feasible on lower configuration GPUs. Each Pytorch [1] & Deepspeed [2] offer FSDP libraries with memory optimizations to tackle this.
  • The identical optimizations also helps to suit larger batch sizes for training job. Increasing batch size can improve your training throughput significantly.

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here