Efficient MultiModal Data Pipeline

-



You’ve got got every thing ready – data, model, a beefy GPU setup. You hit “run” and… wait. And wait some more. Your GPUs are barely breaking a sweat while your wallet’s getting lighter by the hour.

Sound familiar? We have been there. After some detective work on our nanoVLM project, we discovered the actual perpetrator wasn’t our model or hardware, it was our data pipeline being incredibly wasteful.

Here’s what we found:

  1. Idle GPUs: Our model was literally waiting around for data to indicate up
  2. Padding hell: Every batch was full of useless padding tokens that contributed nothing to training

On this post we construct an efficient pipeline in five stages. In each stage we add or remove from the previous step and comment on what went right and what didn’t.



Table of Contents:



[Stage 0] Preparation

To make it easier to follow the info preparation tasks, we created a separate repo laser-focused on the info pipeline only. We hope this shall be much easier to know that reading the code once integrated with the nanoVLM repository. As well as, this could possibly be useful to bootstrap other data pipelines!

Repository: https://github.com/ariG23498/mmdp

To follow along, all you want to do is clone the repository. It accommodates the ultimate data preparation tasks, nevertheless it’s designed to showcase each step of the best way.

$ git clone https://github.com/ariG23498/mmdp.git



[Stage 1] Visualising the Dataset

Before optimizing anything, we’d like to know what we’re working with. Our multimodal dataset has images, text prompts, and responses.

$ uv run 01_check_dataset.py

Dataset Sample

Getting accustomed to your training data is crucial for achievement. The previous script shows a random sample every time you run it; chances are you’ll wish to copy the snippet to a notebook and run it multiple times to get a sense in regards to the data.



[Stage 2] Naive Padding

Our first training attempt used the apparent (and really frequent) approach:

  • Tokenize every thing
  • Find the longest sequence in each batch
  • Pad every thing else to match
$ uv run 02_naive_pad_dataloader.py

The outcomes were painful. Take a look at this visualization:

Naive Padding Waste

See all that gray? That is padding. That is the GPU processing absolutely nothing whilst you pay for compute time. We were wasting roughly 60% of our batch on empty tokens.



[Stage 3] Constrained Padding

Our next move was easy. Set a world maximum length and persist with it. If a sample was too long, we might just drop it.

Constrained Padding

As you may have noticed that the batch now has one sample less. That is because of the filtering process. This helped, but we were still padding every thing to the identical fixed length no matter actual content. Higher than before, but still wasteful.



[Stage 4]: Packing Smarter with Knapsacks

Now we’re able to rethink batching entirely. Padding is the enemy, and we’d like a method to attenuate it while maximizing how much data we will fit into each batch. Enter the knapsack problem, a classic from computer science that’s perfect for this.

Imagine you’re packing a backpack for a hike. It may possibly only hold a lot weight, and you need to cram in as many useful items as possible. In our case:

  • The backpack is a training batch with a maximum token limit (max_length).
  • Each item is a sequence (a tokenized prompt-response pair), and its weight is the variety of tokens.
  • Our goal is to pack as many sequences as possible into the batch without going over the token limit, minimizing wasted space.

To check this concept, we start with a toy dataset: just an inventory of numbers from 1 to 25, each representing a sequence length. This lets us experiment without the complexity of images and text.



Switching to an Iterable Dataset

Most PyTorch datasets are map-style (you access them with dataset[i]). But for dynamic batching, we’d like something more flexible. So, we built an iterable-style dataset by subclassing torch.utils.data.IterableDataset. This lets us generate batches on the fly and handle tricks like sharding data across multiple employees:

def _get_data_range(self):
    worker_info = get_worker_info()
    if worker_info is None:  
        return self.start, self.end
    else:  
        per_worker = int(
            math.ceil((self.end - self.start) / worker_info.num_workers)
        )
        worker_id = worker_info.id
        iter_start = self.start + worker_id * per_worker
        iter_end = min(iter_start + per_worker, self.end)
        return iter_start, iter_end



Producer-Consumer Magic

Packing sequences might be slow, especially if we’re sorting or shuffling. To maintain things moving, we use a producer-consumer pattern using Python queues:

def _producer(self, data_iter, queue, stop_signal):
    if self.strategy == "greedy":
        for pack in self._greedy_packing(data_iter):
            queue.put(pack)
    elif self.strategy == "binpack":
        while True:
            buffer = list(itertools.islice(data_iter, self.buffer_size))
            if not buffer:
                break
            knapsacks = self._bin_packing(buffer)
            for pack in knapsacks:
                queue.put(pack)
    queue.put(stop_signal)

The producer thread packs batches and puts them in a queue, while the most important thread pulls them out as needed. This overlap keeps the pipeline flowing easily.



Greedy Packing

First, we try an easy greedy packing strategy:

def _greedy_packing(self, iterator):
    pack, pack_sum = [], 0
    for item in iterator:
        if item > self.max_length:
            proceed
        if pack_sum + item <= self.max_length:
            pack.append(item)
            pack_sum += item
        else:
            yield pack
            pack = [item]
            pack_sum = item
    if pack:
        yield pack

This walks through the info sequentially, adding items to a pack until it’s full, then starting a brand new one. It’s fast but not perfect. Here’s what the batches appear to be:

=== Strategy: GREEDY ===
[tensor([1]), tensor([2]), tensor([3]), tensor([4]), tensor([5]), tensor([6]), tensor([7]), tensor([8]), tensor([9]), tensor([10]), tensor([11]), tensor([12]), tensor([13])]
[tensor([14]), tensor([15]), tensor([16]), tensor([17]), tensor([18]), tensor([19])]
[tensor([20]), tensor([21]), tensor([22]), tensor([23])]
[tensor([24])]

Greedy Knapsack

Notice how later batches get sparse? We’re leaving gaps.



Bin-Packing for Tighter Matches

Let’s try a better approach: bin-packing (specifically, First Fit Decreasing):

def _bin_packing(self, buffer: List[int]):
    buffer = sorted(buffer, reverse=True)
    knapsacks = []
    for item in buffer:
        for pack in knapsacks:
            if sum(pack) + item <= self.max_length:
                pack.append(item)
                break
        else:
            knapsacks.append([item])

This sorts sequences by length (longest first) and tries to suit each into the primary pack that has room. If none suits, it starts a brand new pack. The result?

=== Strategy: BINPACK ===
[tensor([24]), tensor([23]), tensor([22]), tensor([21]), tensor([10])]
[tensor([20]), tensor([19]), tensor([18]), tensor([17]), tensor([16]), tensor([9]), tensor([1])]
[tensor([15]), tensor([14]), tensor([13]), tensor([12]), tensor([11]), tensor([8]), tensor([7]), tensor([6]), tensor([5]), tensor([4]), tensor([3]), tensor([2])]

Tight

These batches are much tighter, with less wasted space. It’s like playing Tetris together with your data, fitting pieces together snugly.



[Stage 5] Knapsacks for Multimodal Data

Now for the actual deal, applying knapsack packing to our multimodal dataset.

We’re back to photographs, prompts, and responses, and we’d like to pack them efficiently while respecting each token limits and image budgets. Image budgeting is finished in order that images per sample are balanced. We would really like to avoid the case where one GPU must process far more images than one other.

Our latest ConstantLengthDataset class handles the heavy lifting. Here’s how it really works, in comparison with Stage 4:

Concept Stage 4 (Toy Data) Stage 5 (Multimodal Data) Function(s)
Item Integer (sequence length) Full sample (image, prompt, response) VQADataset.__getitem__
Weight The integer itself Variety of tokens (len(input_ids)) —
Knapsack Batch of integers ≤ max_length Batch of samples ≤ seq_length and image limit _balanced_greedy_knapsack
Packing Strategy Greedy or Binpack Greedy packing with token and image constraints _balanced_greedy_knapsack
Producer-Consumer Producer fills queue Same because the toy example, but with multimodal samples _producer, __iter__
Sample Filtering Skip integers > max_length Skip samples with too many tokens or images _producer
Sharding Split integer range Shard dataset indices make_base_iterator()
Batching Group integers Concatenate and align tokens/images _pack_one_group
Output List of integers Dict with input_ids, labels, attention_mask, images yield from __iter__

The ConstantLengthDataset does all of it:

  • Reads samples (images and text).
  • Filters out samples which can be too long or have too many images.
  • Packs samples into batches using a greedy knapsack strategy, balancing token count and image count.
  • Pads the ultimate batches to a hard and fast length, but with way less padding than before.

Here’s the result:

Knapsack Padding

Take a look at that! The grey (padding) is minimal, and the batches are dense with useful data. It’s like packing a suitcase so well you possibly can still zip it up without sitting on it.

The image might sound unintuive at the primary glance, but allow us to take a look at the image side by side with constrained padding.

Knapsack Constrained
Knapsack Padding Constrained Padding

Here you’ll notice that the samples in knapsack are more evenly distributed. We also don’t run into the problem of getting
less samples within the batch because of filtering.



Conclusion

What began as an easy “why is training so slow?” investigation led to a whole rethink of how we handle multimodal data.

The balanced knapsack strategy for data pipeline comes from the Eagle 2: Constructing Post-Training Data Strategies from Scratch for Frontier Vision-Language Models paper from NVIDIA.

The important thing lessons:

  • Padding every thing to the longest sequences is an excellent first approach (but wasteful)
  • Consider batching as a packing problem
  • Consider all of your constraints (text length, image memory, etc.)
  • Test with toy data first to validate your approach

Need to dig deeper? Take a look at:

Comfortable training (and will your GPUs stay busy)!



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x