Home Artificial Intelligence Data Collators in HuggingFace

Data Collators in HuggingFace

0
Data Collators in HuggingFace

What they’re and what they do

Image from unsplash.com

After I began learning HuggingFace, data collators were one among the least intuitive components for me. I had a tough time understanding them, and I didn’t find adequate resources that designate them intuitively.

On this post, we take a have a look at what data collators are, how they differ, and methods to write a customized data collator.

Data collators are a necessary part of information processing in HuggingFace. All of us have used them after tokenizing the info, and before passing the info to the Trainer object to coach the model.

In a nutshell, they put together an inventory of samples right into a mini training batch. What they do relies on the duty they’re defined for, but on the very least they pad or truncate input samples to be certain that all samples in a mini batch are of same length. Typical mini-batch sizes range from 16 to 256 samples, depending on the model size, data, and hardware constraints.

Data collators are task-specific. There’s a knowledge collator for every of the next tasks:

  • Causal language modeling (CLM)
  • Masking language modeling (MLM)
  • Sequence classification
  • Seq2Seq
  • Token classification

Some data collators are easy. For instance for the “sequence classification” task, the info collator just must pad all sequences in a mini batch to make sure they’re of the identical length. It might then concatenate them into one tensor.

Some data collators are quite complex, as they should handle the info processing for that task.

Two of most elementary data collators are as following:

1)DefaultDataCollator: This doesn’t do any padding or truncation. It assumes all input samples are of the identical length. In case your input samples will not be of the identical length, this could throw errors.

import torch
from transformers import DefaultDataCollator

texts = ["Hello world", "How are you?"]

# Tokenize
from transformers import AutoTokenizer
tokenizer =…

LEAVE A REPLY

Please enter your comment!
Please enter your name here