Efficient MultiModal Data Pipeline

You’ve got got every thing ready – data, model, a beefy GPU setup. You hit “run” and… wait. And wait some more. Your GPUs are barely breaking a sweat while your wallet’s getting lighter by the hour.

Sound familiar? We have been there. After some detective work on our nanoVLM project, we discovered the actual perpetrator wasn’t our model or hardware, it was our data pipeline being incredibly wasteful.

Here’s what we found:

Idle GPUs: Our model was literally waiting around for data to indicate up
Padding hell: Every batch was full of useless padding tokens that contributed nothing to training

On this post we construct an efficient pipeline in five stages. In each stage we add or remove from the previous step and comment on what went right and what didn’t.

[Stage 0] Preparation

To make it easier to follow the info preparation tasks, we created a separate repo laser-focused on the info pipeline only. We hope this shall be much easier to know that reading the code once integrated with the nanoVLM repository. As well as, this could possibly be useful to bootstrap other data pipelines!

Repository: https://github.com/ariG23498/mmdp

To follow along, all you want to do is clone the repository. It accommodates the ultimate data preparation tasks, nevertheless it’s designed to showcase each step of the best way.

$ git clone https://github.com/ariG23498/mmdp.git

[Stage 1] Visualising the Dataset

Before optimizing anything, we’d like to know what we’re working with. Our multimodal dataset has images, text prompts, and responses.

$ uv run 01_check_dataset.py

Getting accustomed to your training data is crucial for achievement. The previous script shows a random sample every time you run it; chances are you’ll wish to copy the snippet to a notebook and run it multiple times to get a sense in regards to the data.

[Stage 2] Naive Padding

Our first training attempt used the apparent (and really frequent) approach:

Tokenize every thing
Find the longest sequence in each batch
Pad every thing else to match

$ uv run 02_naive_pad_dataloader.py

The outcomes were painful. Take a look at this visualization:

See all that gray? That is padding. That is the GPU processing absolutely nothing whilst you pay for compute time. We were wasting roughly 60% of our batch on empty tokens.

[Stage 3] Constrained Padding

Our next move was easy. Set a world maximum length and persist with it. If a sample was too long, we might just drop it.

As you may have noticed that the batch now has one sample less. That is because of the filtering process. This helped, but we were still padding every thing to the identical fixed length no matter actual content. Higher than before, but still wasteful.

[Stage 4]: Packing Smarter with Knapsacks

Now we’re able to rethink batching entirely. Padding is the enemy, and we’d like a method to attenuate it while maximizing how much data we will fit into each batch. Enter the knapsack problem, a classic from computer science that’s perfect for this.

Imagine you’re packing a backpack for a hike. It may possibly only hold a lot weight, and you need to cram in as many useful items as possible. In our case:

The backpack is a training batch with a maximum token limit (max_length).
Each item is a sequence (a tokenized prompt-response pair), and its weight is the variety of tokens.
Our goal is to pack as many sequences as possible into the batch without going over the token limit, minimizing wasted space.

To check this concept, we start with a toy dataset: just an inventory of numbers from 1 to 25, each representing a sequence length. This lets us experiment without the complexity of images and text.

Switching to an Iterable Dataset

Most PyTorch datasets are map-style (you access them with dataset[i]). But for dynamic batching, we’d like something more flexible. So, we built an iterable-style dataset by subclassing torch.utils.data.IterableDataset. This lets us generate batches on the fly and handle tricks like sharding data across multiple employees:

def _get_data_range(self):
    worker_info = get_worker_info()
    if worker_info is None:  
        return self.start, self.end
    else:  
        per_worker = int(
            math.ceil((self.end - self.start) / worker_info.num_workers)
        )
        worker_id = worker_info.id
        iter_start = self.start + worker_id * per_worker
        iter_end = min(iter_start + per_worker, self.end)
        return iter_start, iter_end

Producer-Consumer Magic

Packing sequences might be slow, especially if we’re sorting or shuffling. To maintain things moving, we use a producer-consumer pattern using Python queues:

def _producer(self, data_iter, queue, stop_signal):
    if self.strategy == "greedy":
        for pack in self._greedy_packing(data_iter):
            queue.put(pack)
    elif self.strategy == "binpack":
        while True:
            buffer = list(itertools.islice(data_iter, self.buffer_size))
            if not buffer:
                break
            knapsacks = self._bin_packing(buffer)
            for pack in knapsacks:
                queue.put(pack)
    queue.put(stop_signal)

The producer thread packs batches and puts them in a queue, while the most important thread pulls them out as needed. This overlap keeps the pipeline flowing easily.

Greedy Packing

First, we try an easy greedy packing strategy:

def _greedy_packing(self, iterator):
    pack, pack_sum = [], 0
    for item in iterator:
        if item > self.max_length:
            proceed
        if pack_sum + item <= self.max_length:
            pack.append(item)
            pack_sum += item
        else:
            yield pack
            pack = [item]
            pack_sum = item
    if pack:
        yield pack

This walks through the info sequentially, adding items to a pack until it’s full, then starting a brand new one. It’s fast but not perfect. Here’s what the batches appear to be:

=== Strategy: GREEDY ===
[tensor([1]), tensor([2]), tensor([3]), tensor([4]), tensor([5]), tensor([6]), tensor([7]), tensor([8]), tensor([9]), tensor([10]), tensor([11]), tensor([12]), tensor([13])]
[tensor([14]), tensor([15]), tensor([16]), tensor([17]), tensor([18]), tensor([19])]
[tensor([20]), tensor([21]), tensor([22]), tensor([23])]
[tensor([24])]

Notice how later batches get sparse? We’re leaving gaps.

Bin-Packing for Tighter Matches

Let’s try a better approach: bin-packing (specifically, First Fit Decreasing):

def _bin_packing(self, buffer: List[int]):
    buffer = sorted(buffer, reverse=True)
    knapsacks = []
    for item in buffer:
        for pack in knapsacks:
            if sum(pack) + item <= self.max_length:
                pack.append(item)
                break
        else:
            knapsacks.append([item])

This sorts sequences by length (longest first) and tries to suit each into the primary pack that has room. If none suits, it starts a brand new pack. The result?

=== Strategy: BINPACK ===
[tensor([24]), tensor([23]), tensor([22]), tensor([21]), tensor([10])]
[tensor([20]), tensor([19]), tensor([18]), tensor([17]), tensor([16]), tensor([9]), tensor([1])]
[tensor([15]), tensor([14]), tensor([13]), tensor([12]), tensor([11]), tensor([8]), tensor([7]), tensor([6]), tensor([5]), tensor([4]), tensor([3]), tensor([2])]

These batches are much tighter, with less wasted space. It’s like playing Tetris together with your data, fitting pieces together snugly.

[Stage 5] Knapsacks for Multimodal Data

Now for the actual deal, applying knapsack packing to our multimodal dataset.

We’re back to photographs, prompts, and responses, and we’d like to pack them efficiently while respecting each token limits and image budgets. Image budgeting is finished in order that images per sample are balanced. We would really like to avoid the case where one GPU must process far more images than one other.

Our latest ConstantLengthDataset class handles the heavy lifting. Here’s how it really works, in comparison with Stage 4:

Concept	Stage 4 (Toy Data)	Stage 5 (Multimodal Data)	Function(s)
Item	Integer (sequence length)	Full sample (image, prompt, response)	`VQADataset.__getitem__`
Weight	The integer itself	Variety of tokens (`len(input_ids)`)	—
Knapsack	Batch of integers ≤ `max_length`	Batch of samples ≤ `seq_length` and image limit	`_balanced_greedy_knapsack`
Packing Strategy	Greedy or Binpack	Greedy packing with token and image constraints	`_balanced_greedy_knapsack`
Producer-Consumer	Producer fills queue	Same because the toy example, but with multimodal samples	`_producer`, `__iter__`
Sample Filtering	Skip integers > `max_length`	Skip samples with too many tokens or images	`_producer`
Sharding	Split integer range	Shard dataset indices	`make_base_iterator()`
Batching	Group integers	Concatenate and align tokens/images	`_pack_one_group`
Output	List of integers	Dict with `input_ids`, `labels`, `attention_mask`, `images`	`yield` from `__iter__`

The ConstantLengthDataset does all of it:

Reads samples (images and text).
Filters out samples which can be too long or have too many images.
Packs samples into batches using a greedy knapsack strategy, balancing token count and image count.
Pads the ultimate batches to a hard and fast length, but with way less padding than before.

Here’s the result:

Take a look at that! The grey (padding) is minimal, and the batches are dense with useful data. It’s like packing a suitcase so well you possibly can still zip it up without sitting on it.

The image might sound unintuive at the primary glance, but allow us to take a look at the image side by side with constrained padding.

Knapsack	Constrained

Here you’ll notice that the samples in knapsack are more evenly distributed. We also don’t run into the problem of getting
less samples within the batch because of filtering.

Conclusion

What began as an easy “why is training so slow?” investigation led to a whole rethink of how we handle multimodal data.

The balanced knapsack strategy for data pipeline comes from the Eagle 2: Constructing Post-Training Data Strategies from Scratch for Frontier Vision-Language Models paper from NVIDIA.

The important thing lessons:

Padding every thing to the longest sequences is an excellent first approach (but wasteful)
Consider batching as a packing problem
Consider all of your constraints (text length, image memory, etc.)
Test with toy data first to validate your approach

Need to dig deeper? Take a look at:

Comfortable training (and will your GPUs stay busy)!

Source link

Efficient MultiModal Data Pipeline

Table of Contents:

[Stage 0] Preparation

[Stage 1] Visualising the Dataset

[Stage 2] Naive Padding

[Stage 3] Constrained Padding

[Stage 4]: Packing Smarter with Knapsacks

Switching to an Iterable Dataset

Producer-Consumer Magic

Greedy Packing

Bin-Packing for Tighter Matches

[Stage 5] Knapsacks for Multimodal Data

Conclusion

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

The Rule Everyone Misses: The right way to Stop Confusing loc and iloc in Pandas

AI corporations want you to stop chatting with bots and begin managing them

Using Stable Diffusion with Core ML on Apple Silicon

Helping AI agents search to get the very best results out of huge language models

Probabilistic Time Series Forecasting with 🤗 Transformers

Efficient MultiModal Data Pipeline

Table of Contents:

[Stage 0] Preparation

[Stage 1] Visualising the Dataset

[Stage 2] Naive Padding

[Stage 3] Constrained Padding

[Stage 4]: Packing Smarter with Knapsacks

Switching to an Iterable Dataset

Producer-Consumer Magic

Greedy Packing

Bin-Packing for Tighter Matches

[Stage 5] Knapsacks for Multimodal Data

Conclusion

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.