Fixing Gradient Accumulation

Our friends at Unsloth shared a problem regarding gradient accumulation yesterday that affects the transformers Trainer. The initial report comes from @bnjmn_marie (kudos to him!).

Gradient accumulation is supposed to be mathematically akin to full batch training; nonetheless, losses didn’t match between training runs where the setting was toggled on and off.

Where does it stem from?

Contained in the modeling code of every model, transformers offers a “default” loss function that is essentially the most typically used one for the model’s task. It is decided by what the modeling class needs to be used for: query answering, token classification, causal LM, masked LM.

That is the default loss function and it was not meant to be customizable: it is just computed when labels and input_ids are passed as inputs to the model, so the user doesn’t should compute the loss. The default loss is helpful but is proscribed by design: for anything different being done, we expect the labels to not be passed directly, and for users to get the logits back from the model and use them to compute the loss outside of the model.

Nevertheless, the transformers Trainer, in addition to many Trainers, heavily leverage these methods due to the simplicity it offers: it’s a double-edged sword. Providing a straightforward API that becomes different because the use-case differs shouldn’t be a well-thought out API, and we have been caught by surprise ourselves.

To be precise, for gradient accumulation across token-level tasks like causal LM training, the right loss needs to be computed by the overall loss across all batches in a gradient accumulation step divided by the overall variety of all non padding tokens in those batches. This shouldn’t be the identical as the common of the per-batch loss values.
The fix is kind of easy, see the next:

def ForCausalLMLoss(logits, labels, vocab_size, **kwargs):
    # Upcast to drift if we want to compute the loss to avoid potential precision issues
    logits = logits.float()
    # Shift in order that tokens < n predict n
    shift_logits = logits[..., :-1, :].contiguous()
    shift_labels = labels[..., 1:].contiguous()

    # Flatten the tokens
    shift_logits = shift_logits.view(-1, vocab_size)
    shift_labels = shift_labels.view(-1)
    # Enable model parallelism
    shift_labels = shift_labels.to(shift_logits.device)

    num_items = kwargs.pop("num_items", None)
+        loss = nn.functional.cross_entropy(shift_logits, shift_labels, ignore_index=-100, reduction="sum")
+        loss = loss / num_items
-        loss = nn.functional.cross_entropy(shift_logits, shift_labels, ignore_index=-100)
    return loss

How we’re fixing it

To handle this issue, we’re changing the way in which our models and training work in two ways:

If users are using the “default” loss functions, we’ll routinely take into consideration the needed changes when using gradient accumulation, to make sure that the correct loss is reported and utilized, fixing the core issue at hand.
To be certain that any future issues with calculating losses won’t block users, we’ll be exposing an API to let users pass in their very own loss functions to the Trainer directly in order that they can use their very own fix easily until we now have fixed any issues internally and made a brand new transformers release.

All model that inherit from PreTrainedModel now have a loss_function property, which is decided by either:

the config.loss_type: that is to make sure that anyone can use his custom loss. You possibly can do that by modifying the LOSS_MAPPING:

def my_super_loss(logits, labels):
    return loss = nn.functional.cross_entropy(logits, labels, ignore_index=-100)

LOSS_MAPPING["my_loss_type"] = my_super_loss

We’re working to ship the primary change for the most well-liked models on this PR: https://github.com/huggingface/transformers/pull/34191#pullrequestreview-2372725010. Following this, a call for contributions to assist propagate this to the remainder of the models might be done in order that the vast majority of models is supported by next release.

We’re also actively working to ship the second change on this PR: https://github.com/huggingface/transformers/pull/34198, which is able to allow users to make use of their very own loss function and make use of the variety of samples seen per-batch to assist with calculating their loss (and can perform the right loss calculation during gradient accumulation as more models are supported from the prior change)

—

By tomorrow, you must expect the Trainer to behave appropriately with gradient accumulation. Please install from essential with the intention to profit from the fix then:

pip install git+https://github.com/huggingface/transformers

Normally, we’re very aware of bug reports submitted to our issue tracker: https://github.com/huggingface/transformers/issues

This issue has been in Transformers for a while because it’s mostly a default that needs to be updated by the end-user; nonetheless, when defaults turn out to be non-intuitive, they’re sure to be modified. On this instance, we have updated the code and shipped a fix in lower than 24 hours, which is what we aim for issues like this one in transformers. Please, come and submit your issues if you’ve got some; that is the one way we will get transformers to enhance and fit well inside your different use-cases.

The Transformers team 🤗

Source link

Fixing Gradient Accumulation

Where does it stem from?

How we’re fixing it

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

The Hidden Opportunity in AI Workflow Automation with n8n for Low-Tech Corporations

Fast Diffusion for Image Generation

Why Healthcare Leads in Knowledge Graphs

Effective-tuning Llama 2 70B using PyTorch FSDP

Optimizing your LLM in production

Fixing Gradient Accumulation

Where does it stem from?

How we’re fixing it

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.