How NuminaMath Won the first AIMO Progress Prize

This 12 months, Numina and Hugging Face collaborated to compete within the 1st Progress Prize of the AI Math Olympiad (AIMO). This competition involved fine-tuning open LLMs to unravel difficult math problems that prime school students use to coach for the International Math Olympiad. We’re excited to share that our model — NuminaMath 7B TIR — was the winner and managed to unravel 29 out of fifty problems on the private test set 🥳!

On this blog post, we introduce the Numina initiative and the technical details behind our winning solution. If you need to skip straight to testing out the model along with your hardest math problems, take a look at our demo.

Let’s dive in!

Introducing Numina – an open AI4Maths initiative
The AI Math Olympiad (AIMO) prize
Our winning solution for the first progress prize
The training recipe
Good data is all you would like
Taming the variance with Self-Consistency and Tool Integrated Reasoning (SC-TIR)
Avoiding the curse of overfitting
Other ideas we tried
Numina’s future – searching for contributors and partners!
Acknowledgements

Introducing Numina – an open AI4Maths initiative

There’s something very special about mathematics.

Mathematics is a website accessible to everyone, even to children long before they will read. One in every of the best mathematicians of all time Srinivasa Ramanujan was self-taught, being born right into a modest family in India in 1887. While for others it varies from recreation to occupation and every thing in between.

Mathematics is important to humanity, being the backbone upon which now we have built every thing from commerce to iPhones and nuclear power plants. Yet even solving maths problems for a critical application generally is a playful experience.

Pure mathematics transcends intelligence like an countless ocean only the mind can sail.

This is the reason after we began Numina, going open-source and open-dataset was the natural option. As for human intelligence, we consider progress in artificial intelligence for maths ought to be universal. If the pc is a bicycle for the mind, artificial intelligence is its engine – opening latest horizons for the Ramanujans of our time.

With the initial support from Mistral AI, Numina was founded late 2023 by a collective enthusiastic about AI and arithmetic (Jia Li, Yann Fleureau, Guillaume Lample, Stan Polu and Hélène Evain), inspired by the AI Math Olympiad (AIMO) competition initiated by Alex Gerko and XTX Markets.

In early 2024, the Numina team was reinforced by two LLM fine-tuning experts from Hugging Face (👋 Lewis Tunstall and Ed Beeching) to tackle the 2024 AIMO progress prize. We also received additional support from General Catalyst and Answer.ai, and by March 2024, Numina had gathered a team of top talents from all around the globe.

With the team in place, it was time to tackle the AIMO challenge!

The AI Math Olympiad (AIMO) prize

Yearly, highschool students from all around the globe compete within the International Math Olympiad – a contest to unravel six difficult problems across domains like algebra, geometry, and number theory. To provide you a way of the problem involved, here’s certainly one of last 12 months’s problems:

In November 2023, the AIMO Prize was launched to drive the open development of AI models that excel in mathematical reasoning. A grand prize of $5M will probably be awarded to whoever can create an AI model that may win a gold medal within the IMO. Alongside the grand prize, AIMO has introduced a series of progress prizes to mark milestones toward this ultimate goal. The primary progress prize was held as a Kaggle competition, with problems which can be less difficult than those within the IMO but are at the extent of IMO preselection. Here’s an example problem, which is somewhat easier to unravel than the IMO example above, but still tricky for LLMs:

Let $k, l > 0$

The competition featured two sets of fifty problems, forming private and non-private leaderboards, with problems hidden from competitors. The issues, comparable in difficulty to AMC12 and AIME exams, require integer outputs for verification. The private leaderboard determined the ultimate rankings. Competitors can submit solutions twice each day, using only open-weight models released before February 23. Each submission is allocated either a P100 GPU or 2xT4 GPUs and as much as 9 hours to unravel the 50 problems.

Given these constraints and rules, strategic selections were essential to develop our winning solution.

Our winning solution for the first progress prize

After much iteration throughout the competition, our solution to the first Progress Prize consisted of three most important components:

A recipe to fine-tune DeepSeekMath-Base 7B to act as a “reasoning agent” that may solve mathematical problems via a combination of natural language reasoning and the usage of the Python REPL to compute intermediate results.
A novel decoding algorithm for tool-integrated reasoning (TIR) with code execution feedback to generate solution candidates during inference.
Quite a lot of internal validation sets that we used to guide model selection and avoid overfitting to the general public leaderboard.

We used a combination of open-source libraries to coach our models, notably TRL, PyTorch, vLLM, and DeepSpeed. On one node of 8 x H100 GPUs, our models took 10 hours to coach.

The training recipe

Our fine-tuning recipe was largely based on the MuMath-Code paper, which involves training the model in two stages:

Two-stage training method from the MuMath-Code paper

Stage 1: Tremendous-tune the bottom model on a big, diverse dataset of natural language math problems and solutions, where each solution is templated with Chain of Thought (CoT) to facilitate reasoning.
Stage 2: Tremendous-tune the model from Stage 1 on an artificial dataset of tool-integrated reasoning, where each math problem is decomposed right into a sequence of rationales, Python programs, and their outputs. Here, we followed Microsoft’s ToRA paper and prompted GPT-4 to supply solutions within the ToRA format with code execution feedback. Tremendous-tuning on this data produces a reasoning agent that may solve mathematical problems via a combination of natural language reasoning and the usage of the Python REPL to compute intermediate results (see screenshot below).

Figure from the ToRA paper on the tool-integrated reasoning format we trained our models with.

We performed “full fine-tuning” in each stages, where all model weights were updated during backpropagation. In other words, we didn’t use parameter-efficient techniques like LoRA or DoRA because we weren’t confident they might match the performance of full fine-tuning without significant experimentation. We used the “packing” feature from TRL’s SFTTrainer to concatenate multiple samples in a single chunk of 2048 tokens. All models were trained with gradient checkpointing and sharded with the DeepSpeed ZeRO-3 protocol to make sure the weights, gradients, and optimizer states could fit inside the available VRAM. See below for the most important hyperparameters we utilized in each stage:

	Stage 1	Stage 2
learning rate	2.0 E-5	2.0 E-5
total batch size	32	32
block size	2048	1024
num epochs	3	4
lr scheduler	cosine	cosine
warmup ratio	0.0	0.1

Our initial submissions used DeepSeek 7B models that were only fine-tuned on Stage 1, but we found the performance was quite limited, with 8/50 being our greatest rating on the general public leaderboard using maj@32. It was Abdur Rafae’s public prize notebook that prompted us to check out integrating code execution within the training recipe. Initially, we focused on the Mixture of Minimal Optimal Sets (MMOS) dataset, as described within the notebook’s title. We found that using MMOS improved performance but was still capped at 16/50 on the general public leaderboard with maj@32, likely because of the undeniable fact that MMOS only consists of single-turn solutions (i.e., the model only generates a single Python program, which is insufficient for hard problems). We later realized that MMOS was a misnomer and that Kaggle notebooks were actually running the DeepSeekMath 7B RL model, which is able to multi-step reasoning and code execution.

At this point, we focused our efforts on producing a dataset much like the one utilized by the DeepSeekMath Instruct / RL models, and this, along with the MuMath-Code recipe, led to significant improvements.

Let’s take a take a look at how we built these datasets.

Good data is all you would like

When it comes to the dataset, now we have extensively referred to DeepSeek Math and other scholars’ approaches, scaling them up significantly. This has resulted in a fine-tuned dataset of several hundred thousand problem-solution pairs, covering topics from highschool mathematics to competition-level mathematics. This dataset will probably be fully open-sourced over the following few weeks, potentially with larger models to see how well our recipe scales. Please discuss with our upcoming dataset technical report for details on the dataset construction.
With regards to the progress prize, now we have built two datasets so to finetune our model.

Chain of Thought

This dataset consists of several hundred thousand problems, each with solutions written in a Chain of Thought manner. The sources of the dataset range from Chinese highschool math exercises to US and international mathematics olympiad competition problems. The information were primarily collected from online exam paper PDFs and arithmetic discussion forums.

The processing steps include:

OCR from the unique PDFs.
Segmentation into problem-solution pairs.
Translation into English.
Realignment to supply a Chain of Thought reasoning format.
Final answer formatting.

Tool-integrated reasoning

Tool-integrated reasoning (TIR) plays a vital role on this competition. Nonetheless, collecting and annotating such data is each costly and time-consuming. To handle this, we chosen roughly 60,000 problems from the Numina dataset, specializing in those with numerical outputs, most of that are integers.

We then utilized a pipeline leveraging GPT-4 to generate TORA-like reasoning paths, executing the code and producing results until the answer was complete. We filtered out solutions where the ultimate answer didn’t match the reference and repeated this process 3 times to make sure accuracy and consistency. This iterative approach allowed us to generate high-quality TORA data efficiently.

As a degree of reference, here is the performance of our Stage 1 model NuminaMath-7B-CoT and final Stage 2 model NuminaMath-7B-TIR on the MATH benchmark in comparison with other open and proprietary models:

Model	MATH (%)
	Chain of Thought Reasoning
GPT-4 (2023)	42.5
GPT-4o	76.6
Claude 3.5 Sonnet	71.1
DeepSeekMath-7B-Instruct	46.8
DeepSeekMath-7B-RL	51.7
NuminaMath-7B-CoT	56.3
	Tool-Integrated Reasoning
DeepSeekMath-7B-Instruct	57.4
DeepSeekMath-7B-RL	58.8
NuminaMath-7B-TIR	68.2

Performance on MATH benchmark. All numbers, unless explicitly stated,
are obtained with zero-shot greedy decoding.

Taming the variance with Self-Consistency and Tool Integrated Reasoning (SC-TIR)

As other competitors noted, this competition posed several challenges with respect to model submission and evaluation:

The evaluation API provides problems in random order, so tactics like early stopping produce high variance because one run could have more hard problems firstly, which leaves less time for the rest (and vice versa)
Most innovations in LLM inference require access to modern GPUs, so standard methods like Flash Attention 2 or torch.compile don’t work on T4 GPUs. Similarly, modern data types like bfloat16 are usually not supported, which prompted us to explore post-training quantization methods like AWQ and GPTQ.

Initially, we used Abdur Rafae’s public notebook for our submissions, but found the high variance to be problematic. To handle this, we took a special approach based on tool-integrated reasoning:

For every problem, copy the input N times to define the initial batch of prompts to feed vLLM. This effectively defines the variety of candidates one uses for majority voting.
Sample N diverse completions until an entire block of Python code is produced.
Execute each Python block and concatenate the output, including tracebacks in the event that they appear.
Repeat M times to supply a batch of generations of size N and depth M, allowing the model to self-correct code errors using the traceback. If a sample fails to supply sensible outputs (e.g., incomplete code blocks), prune that result.
Postprocess the answer candidates after which apply majority voting to pick out the ultimate answer

For our winning submission, we generated N=48 candidates with a depth of M=4. Increasing either parameter didn’t improve performance, so we took a conservative approach to remain inside the cut-off date. In effect, this algorithm augments Self Consistency with CoT (shown below) with Tool-Integrated Reasoning.

We found that our SC-TIR algorithm produced more robust results with significantly less variance on each our internal evaluations and the general public leaderboard.

One technical detail price mentioning is that we found it helpful to quantize the models in 8-bit precision. This was for 3 reasons:

It is extremely slow to upload models to the Kaggle Hub, and compressing the model made this step twice as fast.
T4 GPUs don’t support bfloat16, and casting to float16 results in a degradation in model performance. Casting to float32 was impossible as that exceeded the available GPU memory.
As well as, a 16-bit model consumes roughly 32GB VRAM simply to load the weights. With 2xT4s, this may have required manipulating the KV cache to run fast, and we found it useful to tradeoff model precision for speed.

We quantized our models using AutoGPTQ together with the training datasets for calibration. In practice, this led to a small drop in accuracy but provided the most effective compromise to accommodate the constraints imposed by evaluation on the Kaggle platform.

Avoiding the curse of overfitting

Overfitting to the general public leaderboard is a typical risk in Kaggle competitions, and much more so when the test set is just 50 problems. As well as, the foundations allowed at most two submissions per day, making a strong internal validation dataset crucial for pacing our development. As specified by the AIMO team, the test problems are of intermediate difficulty, between AMC12 and AIME levels, with integer outputs.

To guide model selection, we used 4 internal validation sets to gauge the performance of our models on math problems of various difficulty. To avoid potential contamination in the bottom model, we chosen problems from AMC12 (2022, 2023) and AIME (2022, 2023, 2024) to create two internal validation datasets:

AMC (83 problems): We picked all the issues from AMC12 22, AMC12 23 and kept those who will be converted to integer outputs. This ends in a dataset of 83 problems. This validation set was designed to be representative of the private test set on Kaggle since we knew from the competition description that the issues were of this level or harder. We found our models could solve about 60-65% of those problems. To measure the variance, we ran each evaluation with 5-10 different seeds and typically saw variations of around 1-3% with our SC-TIR algorithm.
AIME (90 problems): We picked all the issues from AIME 22, AIME 23, and AIME 24 to measure how well our models could perform on difficult problems, in addition to to gauge essentially the most common failure modes. As above, we ran each evaluation with 5-10 seeds to measure variation.

As a consequence of the small size of the AMC/AIME validation sets, model performance on these datasets was vulnerable to noise, much like the general public leaderboard. To raised assess our model’s performance, we also evaluated it using a subset of the MATH test set, which accommodates 5,000 problems. We retained only the issues with integer outputs, to simplify majority voting and mimic competition evaluation. This resulted in two additional validation sets:

MATH level 4 (754 problems)
MATH level 5 (721 problems)

Through the use of these 4 validation sets, we were capable of pick essentially the most promising models across different training stages and narrow down the alternative of hyperparameters. We found that combining small but representative validation sets with larger ones was useful on this particular competition, where each submission is subject to some stochasticity from sampling.

Other ideas we tried

As mentioned above, we tried a couple of approaches that were ultimately discarded in favor of the MuMath-Code recipe:

Training a pure CoT model and using majority voting for evaluation
Training an MMOS model to unravel problems with Python in a single step

One other technique we tried was applying Kahneman-Tversky Optimisation (KTO) to latest completions sampled from the SFT model. Here the approach was much like OrcaMath, namely:

Sample 4 completions per problem with the SFT model, using interleaved rationales and code execution. We used the SFT dataset from Stage 2 because the source of prompts.
Extract the reply and compare it with the bottom truth. If correct, label the sample as positive, else negative.
Apply KTO to the SFT model on this dataset.

We found this way of on-policy KTO produced a rather higher model than the SFT one (a couple of percentage points on our internal evaluations) and scored 27/50 on the general public leaderboard.

One nice feature of KTO is you can track the implicit reward during training, and this really helps with debugging a run – for instance, here’s certainly one of our successful training logs where one sees the chosen (i.e., correct solutions) rewards increase over training, while the rejected ones are suppressed.

Unfortunately, we ran out of time to use this method to our final SFT model, so it is feasible we could have been capable of 1-2 more problems!

We also experimented with applying our SFT recipe to larger models like InternLM-20B, CodeLama-33B, and Mixtral-8x7B but found that (a) the DeepSeek 7B model could be very hard to beat because of its continued pretraining on math, and (b) inference could be very slow on 2xT4 GPUs, and we experienced quite a few mysterious timeouts that we couldn’t trace the foundation reason for.

One other failed experiment includes attempting to use reinforcement learning (specifically the Proximal Policy Optimization algorithm and REINFORCE-leave-one-out (RLOO) algorithm) with code execution feedback and shaped rewards for writing code and getting correct/incorrect solutions. We applied this to the DeepSeekMath 7B RL model. While we saw some promising reward curves, we didn’t see any significant gains in performance. On condition that online methods like RLOO are bottlenecked by text generation and slow to iterate with, we abandoned reinforcement learning in favor of experimenting with KTO.

On the inference side, we also experimented with:

Using a static KV cache and torch compilation. We found we were capable of speed up generation in native transformers code by 2-3x on a H100, but hit a wide range of cryptic errors on the Kaggle T4s, mostly because of the shortage of support for model sharding with torch compilation in speed up.

Quite a lot of model merging techniques like DARE, TIES, and WARP. Here we used mergekit to merge the SFT and KTO models, or the SFT models with the general public DeepSeekMath ones. Overall we found these merges led to either significant regressions on our internal evaluations and we ran out of time to explore this more deeply.

Numina’s future – searching for contributors and partners!

Following the initial success of Numina at winning the AIMO 2024 progress prize, we now aim to pursue our mission of fostering the event of artificial and human intelligence in the sector of mathematics. You may visit our website to know more about our projects and please at all times be happy to drop us a note at contact@projectnumina.ai.

Numina, like mathematics, is supposed to be open to talents and supporters from all around the globe who’re willing to bring mathematics further with AI!

Acknowledgements

We thank Thomas Wolf and Leandro von Werra for enabling the Numina and Hugging Face collaboration. We also thank Hugo Larcher for helping make the GPUs go brrrr on the Hugging Face cluster, Colin Raffel for his advice on model merging methods, and Omar Sanseviero for feedback on the blog post.

We also wanted to specific our gratitude to Mistral.ai, General Catalyst, Answer.AI, and Beijing International Center for Mathematical Research @ Peking University who supported the project from the start.

Finally, we thank the AIMO Prize team for launching such an exciting and provoking competition!

Source link

How NuminaMath Won the first AIMO Progress Prize

Introducing Numina – an open AI4Maths initiative

The AI Math Olympiad (AIMO) prize

Our winning solution for the first progress prize

The training recipe