Comprehensive Guide to Datasets and Dataloaders in PyTorch


The complete guide to creating custom datasets and dataloaders for various models in PyTorch

Source: GPT4o Generated

Before you’ll be able to construct a machine learning model, you’ll want to load your data right into a dataset. Luckily, PyTorch has many commands to assist with this complete process (in the event you should not aware of PyTorch I like to recommend refreshing on the fundamentals here).

PyTorch has good documentation to assist with this process, but I actually have not found any comprehensive documentation or tutorials towards custom datasets. I’m first going to begin with creating basic premade datasets after which work my way as much as creating datasets from scratch for various models!

Before we dive into code for various use cases, let’s understand the difference between the 2 terms. Generally, you first create your dataset after which create a dataloader. A dataset comprises the features and labels from each data point that will likely be fed into the model. A dataloader is a custom PyTorch iterable that makes it easy to load data with added features.

DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,
batch_sampler=None, num_workers=0, collate_fn=None,
pin_memory=False, drop_last=False, timeout=0,
worker_init_fn=None, *, prefetch_factor=2,

Probably the most common arguments within the dataloader are batch_size, shuffle (often just for the training data), num_workers (to multi-process loading the information), and pin_memory (to place the fetched data Tensors in pinned memory and enable faster data transfer to CUDA-enabled GPUs).

It’s endorsed to set pin_memory = True as a substitute of specifying num_workers because of multiprocessing complications with CUDA.

Within the case that your dataset is downloaded from online or locally, it can be very simple to create the dataset. I believe PyTorch has good documentation on this, so I will likely be temporary.

When you know the dataset is either from PyTorch or PyTorch-compatible, simply call the mandatory imports and the dataset of selection:

from import Dataset
from torchvision import datasets
from torchvision.transforms imports ToTensor

data = torchvision.datasets.CIFAR10('path', train=True, transform=ToTensor())

Each dataset may have unique arguments to pass into it (found here). Usually, it can be the trail the dataset is stored at, a boolean indicating if it must be downloaded or not (conveniently called download), whether it’s training or testing, and if transforms must be applied.

I dropped in that transforms may be applied to a dataset at the tip of the last section, but what actually is a transform?

A transform is a technique of manipulating data for preprocessing a picture. There are a lot of different facets to transforms. Probably the most common transform, ToTensor(), will convert the dataset to tensors (needed to input into any model). Other transforms built into PyTorch (torchvision.transforms) include flipping, rotating, cropping, normalizing, and shifting images. These are typically used so the model can generalize higher and doesn’t overfit to the training data. Data augmentations can be used to artificially increase the scale of the dataset if needed.

Beware most torchvision transforms only accept Pillow image or tensor formats (not numpy). To convert, simply use

To convert from numpy, either create a torch tensor or use the next:

From PIL import Image
# assume arr is a numpy array
# you might must normalize and forged arr to np.uint8 depending on format
img = Image.fromarray(arr)

Transforms may be applied concurrently using torchvision.transforms.compose. You possibly can mix as many transforms as needed for the dataset. An example is shown below:

import torchvision.transforms.Compose

dataset_transform = transforms.Compose([
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])

You’ll want to pass the saved transform as an argument into the dataset for it to be applied within the dataloader.

Generally of developing your individual model, you have to a custom dataset. A standard use case could be transfer learning to use your individual dataset on a pretrained model.

There are 3 required parts to a PyTorch dataset class: initialization, length, and retrieving a component.

__init__: To initialize the dataset, pass within the raw and labeled data. One of the best practice is to pass within the raw image data and labeled data individually.

__len__: Return the length of the dataset. Before creating the dataset, the raw and labeled data ought to be checked to be the identical size.

__getitem__: That is where all the information handling occurs to return a given index (idx) of the raw and labeled data. If any transforms must be applied, the information should be converted to a tensor and transformed. If the initialization contained a path to the dataset, the trail should be opened and data accessed/preprocessed before it could be returned.

Example dataset for a semantic segmentation model:

from import Dataset
from torchvision import transforms

class ExampleDataset(Dataset):
"""Example dataset"""

def __init__(self, raw_img, data_mask, transform=None):
self.raw_img = raw_img
self.data_mask = data_mask
self.transform = transform

def __len__(self):
return len(self.raw_img)

def __getitem__(self, idx):
if torch.is_tensor(idx):
idx = idx.tolist()

image = self.raw_img[idx]
mask = self.data_mask[idx]

sample = {'image': image, 'mask': mask}

if self.transform:
sample = self.transform(sample)

return sample


What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
Inline Feedbacks
View all comments

Share this article

Recent posts

Would love your thoughts, please comment.x