How I Won the “Mostly AI” Synthetic Data Challenge

I within the Mostly AI Prize and won each the FLAT and SEQUENTIAL data challenges. The competition was a improbable learning experience, and on this post, I need to supply some insights into my winning solution.

The Competition

The goal of the competition was to generate an artificial dataset with the identical statistical properties as a source dataset, without copying the info.

Source: https://www.mostlyaiprize.com/.

The competition was split into two independent challenges:

FLAT Data Challenge: Generate 100,000 records with 80 columns.
SEQUENTIAL Data Challenge: Generate 20,000 sequences (groups) of records.

To measure the standard of the synthetic data, the competition used an Overall Accuracy metric. This rating measures the similarity between the synthetic and source distributions for single columns (univariates), pairs of columns (bivariates), and triples of columns (trivariates) using the L1 distance. Moreover, privacy metrics like DCR (Distance to Closest Record) and NNDR (Nearest Neighbor Distance Ratio) were used to make sure submissions weren’t just overfitting or copying the training data.

A sample of the training dataset for the FLAT challenge. Image by writer.

Solution Design

Initially, my goal was to create an ensemble of multiple different state-of-the-art models and mix their generated data. I experimented quite a bit with different models, but the outcomes didn’t improve as much as I had hoped.

I pivoted my approach and focused on post-processing. First, I trained a single generative model from the Mostly AI SDK, and as an alternative of generating the required variety of samples for the submission, I oversampled to create a big pool of candidate samples. From this pool, I then chosen the ultimate output in a way that matches the statistical properties of the source dataset far more closely.

This approach led to a considerable jump within the leaderboard rating. For the FLAT data challenge, the raw synthetic data from the model scored around 0.96, but after post-processing, the rating jumped to 0.992. I used a modified version of this approach for the SEQUENTIAL data challenge, which yielded an identical improvement.

My final pipeline for the FLAT challenge consisted of three fundamental steps:

Iterative Proportional Fitting (IPF) to pick an oversized, high-quality subset.
Greedy Trimming to cut back the subset to the goal size by removing the worst-fitting samples.
Iterative Refinement to shine the ultimate dataset by swapping samples for higher fitting ones.

The impact of every post-processing step on the ultimate accuracy rating for the FLAT challenge. Image by writer.

Step 1: Iterative Proportional Fitting (IPF)

Step one in my post-processing pipeline was to get a powerful initial subset from the oversampled pool (2.5 million generated rows). For this, I used Iterative Proportional Fitting (IPF).

IPF is a classical statistical algorithm used to regulate a sample distribution to match a known set of marginals. On this case, I wanted the synthetic data’s bivariate (2-column) distributions to match those of the unique data. I also tested uni- and trivariate distributions, but I discovered that specializing in the bivariate relationships yielded one of the best performance while being computationally fast.

Here’s the way it worked:

I identified the 5,000 most correlated column pairs within the training data using mutual information. These are an important relationships to preserve.
IPF then calculated fractional weights for every of the two.5 million synthetic rows. The weights were adjusted iteratively in order that the weighted sums of the bivariate distributions within the synthetic pool matched the goal distributions from the training data.
Finally, I used an expectation-rounding approach to convert these fractional weights into an integer count of how persistently each row must be chosen. This resulted in an oversized subset of 125,000 rows (1.25x the required size) that already had very strong bivariate accuracy.

The IPF step provided a high-quality start line for the subsequent phase.

Step 2: Trimming

Generating an oversized subset of 125,000 rows from IPF was a deliberate selection that enabled this extra trimming step to remove samples that didn’t fit well.

I used a greedy approach that iteratively calculates the “error contribution” of every row in the present subset. The rows that contribute probably the most to the statistical distance from the goal distribution are identified and removed. This process repeats until only 100,000 rows remain, ensuring that the worst 25,000 rows are discarded.

Step 3: Refinement (Swapping)

The ultimate step was an iterative refinement process to swap rows from the subset with higher rows from the much larger, unused data pool (the remaining 2.4 million rows).

In each iteration, the algorithm:

Identifies the worst rows inside the current 100k subset (those contributing most to the L1 error).
Searches for one of the best alternative candidates from the skin pool that would cut back the L1 error if swapped in.
Performs the swap if it leads to a greater overall rating.

Because the accuracy of the synthetic sample is already quite high, the extra gain from this process is reasonably small.

Adapting for the Sequential Challenge

The SEQUENTIAL challenge required an identical approach, but with two changes. First, a sample consists of several rows, connected by the group ID. Secondly, the competition metric adds a measure for coherence. This implies not only do the statistical distributions have to match, however the sequences of events also should be much like the source dataset.

A sample of the training dataset for the SEQUENTIAL challenge. Image by writer.

My post-processing pipeline was adapted to handle groups and in addition optimize for coherence:

Coherence-Based Pre-selection: Before optimizing for statistical accuracy, I ran a specialized refinement step. This algorithm iteratively swapped entire groups (sequences) to specifically match the coherence metrics of the unique data, similar to the distribution of “unique categories per sequence” and “sequences per category”. This ensured that we continued the post-processing with a sound sequential structure.
Refinement (Swapping): The 20,000 groups chosen for coherence then went through the identical statistical refinement process because the flat data. The algorithm swapped entire groups with higher ones from the pool to attenuate the L1 error of the uni-, bi-, and trivariate distributions. A secret ingredient was to incorporate the “Sequence Length” as a feature, so the group lengths are also considered within the swapping.

This two-stage approach ensured the ultimate dataset was strong in each statistical accuracy and sequential coherence. Interestingly, the IPF-based approach that worked so well for the flat data was less effective for the sequential challenge. Due to this fact, I removed it to focus computing time on the coherence and swapping algorithms, which yielded higher results.

Making It Fast: Key Optimizations

The post-processing strategy by itself was computationally expensive, and making it run inside the competition closing date was a challenge in itself. To succeed, I relied on just a few key optimizations.

First, I reduced the info types wherever possible to handle the huge sample data pool without running out of memory. Changing the numerical style of a big matrix from 64-bit to 32 or 16-bit greatly reduces the memory footprint.

Secondly, when changing the info type was not enough, I used sparse matrices from SciPy. This system allowed me to store the statistical contributions of every sample in an incredibly memory-efficient way.

Lastly, the core refinement loop involved a variety of specialized calculations, a few of which were very slow with numpy. To beat this, I used numba. By extracting the bottlenecks in my code into specialized functions with the @numba.njit decorator, Numba routinely translated them into highly optimized machine code that runs at speeds comparable to C.

Here is an example of how I needed to hurry up the summation of rows in sparse matrices, which was a significant bottleneck in the unique NumPy version.

import numpy as np
import numba

# This may make the logic run a whole lot of times faster.
@numba.njit
def _rows_sum_csr_int32(data, indices, indptr, rows, K):
    """
    Sum CSR rows right into a dense 1-D vector without creating
    intermediate scipy / numpy objects.
    """
    out = np.zeros(K, dtype=np.int32)
    for r in rows:
        start = indptr[r]
        end = indptr[r + 1]
        for p in range(start, end):
            out[indices[p]] += data[p]
    return out

Nonetheless, Numba isn’t a silver bullet; it’s helpful for numerical, loop-heavy code, but for many calculations, it is quicker and easier to persist with vectorized NumPy operations. I counsel you to only try it when a NumPy approach doesn’t reach the required speed.

Final Thoughts

The highest five submissions for every challenge. Source: https://github.com/mostly-ai/the-prize-eval/.

Regardless that ML models are getting increasingly stronger, I believe that for many problems that Data Scientists try to unravel, the key ingredient is commonly not within the model. In fact, a powerful model is an integral a part of an answer, however the pre- and postprocessing are equally necessary. For these challenges, a post-processing pipeline targeted specifically for the evaluation metric led me to the winning solution, with none additional ML.

I learned quite a bit on this challenge, and I need to thank Mostly AI and the jury for his or her great job in organizing this improbable competition.

My code and solutions for each challenges are open-source and will be found here:

How I Won the “Mostly AI” Synthetic Data Challenge

The Competition

Solution Design

Step 1: Iterative Proportional Fitting (IPF)

Step 2: Trimming

Step 3: Refinement (Swapping)

Adapting for the Sequential Challenge

Making It Fast: Key Optimizations

Final Thoughts

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Designing digital resilience within the agentic AI era

OpenAI pushes Codex to the Max

The right way to Perform Agentic Information Retrieval

The price of considering

Google Stakes AI leadership with Gemini 3

How I Won the “Mostly AI” Synthetic Data Challenge

The Competition

Solution Design

Step 1: Iterative Proportional Fitting (IPF)

Step 2: Trimming

Step 3: Refinement (Swapping)

Adapting for the Sequential Challenge

Making It Fast: Key Optimizations

Final Thoughts

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.