When fine-tuning large language models (LLMs) locally, using large batch sizes is commonly impractical as a consequence of their substantial GPU memory consumption. To beat this limitation, a method called gradient accumulation is usually used to simulate larger batch sizes. As an alternative of updating the model weights after processing each batch, gradient accumulation involves summing the gradients over several smaller mini-batches. The model weights are updated only after a predetermined variety of these mini-batches have been processed. This method effectively mimics training with a bigger batch size without the memory overhead typically related to it.
As an illustration, setting a mini-batch size of 1 and accumulating gradients over 32 mini-batches must be such as training with a full batch size of 32. Nevertheless, I discovered that gradient accumulation often leads to significantly degraded performance in comparison with training with larger actual batch sizes with popular deep-learning frameworks like Transformers.
After sharing this issue on X and Reddit, Daniel Han from Unsloth AI replicated the issue. He found that it was affecting not only gradient accumulation but in addition multi-GPU setups. In such…