Latest training method boosts AI multimodal reasoning with smaller, smarter datasets

-



Researchers at MiroMind AI and several other Chinese universities have released OpenMMReasoner, a brand new training framework that improves the capabilities of language models in multimodal reasoning.

The framework uses a two-stage process. It first refines a base model with a curated dataset in a supervised fine-tuning (SFT) stage. Then, a reinforcement learning (RL) stage guides the model to reason more effectively in tasks that involve each text and visual data. 

Experiments show that models trained with OpenMMReasoner outperform other leading visual reasoning models, often while being trained on a smaller, higher-quality dataset. The framework and all its assets, including a trained 7B model, are fully open source, providing a reliable foundation for constructing applications that require traceability and robustness.

In accordance with Kaichen Zhang, co-author of a research paper that outlines the brand new method, OpenMMReasoner offers significant advantages for businesses looking beyond large, closed systems. "A smaller open-source reasoning model has practical benefits: Enterprises can deploy it locally, reduce latency, lower token costs related to long chains of thought, maintain full control over their data and [it is] fine-tunable to adapt to their specific downstream task," he told VentureBeat.

The challenge of transparent multimodal reasoning

Recent advances in reinforcement learning with verifiable rewards (RLVR) have significantly improved the reasoning abilities of huge language models (LLMs). RLVR trains LLMs to generate chain-of-thought (CoT) tokens (which mimic the reasoning processes humans use) before generating the ultimate answer. This improves the model’s capability to resolve complex reasoning tasks equivalent to math and coding. 

Motivated by this success, researchers have applied similar RL-based methods to large multimodal models (LMMs), showing that the advantages can extend beyond text to enhance visual understanding and problem-solving across different modalities.

Nonetheless, an absence of transparency within the training pipeline has been a significant barrier. Many studies on multimodal reasoning don’t provide detailed details about their data curation and training processes, making it difficult to breed their results or understand what makes these models work.

“This lack of openness restricts reproducibility and obscures a deeper understanding of how reasoning-capable LMMs are literally built and the way their training dynamics evolve,” the researchers note.

The OpenMMReasoner recipe

OpenMMReasoner addresses this gap with a totally transparent and scalable training recipe built on open-source LMMs. The researchers found it was critical to curate high-quality datasets by scaling data diversity. Although using diverse data sources is very important, increasing the range of correct answers for a similar query was a necessary axis for improvement.

The primary stage of the recipe is a three-step supervised fine-tuning (SFT) pipeline. It begins with data sourcing, where the team collected roughly 103,000 raw question-answer pairs from public datasets covering general visual Q&A and reasoning tasks. Next, they added an information distillation step, using a robust model (Qwen3-VL-235B-Instruct) to generate latest, high-quality reasoning traces for chosen questions. (The information will then be used to coach a smaller model.)

To extend answer diversity, the team generated multiple verified reasoning traces for every query. This expanded the dataset to 583,000 samples. Finally, they implemented a “domain mixing” phase, adding data from mathematical reasoning domains to further generalize the model's capabilities, leading to a final SFT dataset of 874,000 examples.

The second stage is an RL recipe that uses a smaller, 74,000-sample dataset curated from domains like science, math and puzzles. The model is trained with a composite reward function that considers each the correctness of the ultimate answer and the consistency of the output format. To enhance efficiency, the method features a penalty for "overthinking," discouraging the model from generating excessively long answers (an issue with many reasoning models trained through RL, which mistakenly learn to generate overly long reasoning sequences, leading to excess cost and slower answers).

This recipe can provide a blueprint for enterprises training their very own models. "For firms with limited domain-specific data, a feasible strategy is to first increase answer diversity for his or her existing dataset, then use domain mixing to integrate this domain data right into a general reasoning recipe like ours," Zhang explained. "This enables the model to accumulate strong general-purpose reasoning skills while also adapting to industry-specific tasks, without having thousands and thousands of samples."

A more efficient and capable reasoning model

In accordance with Zhang, the step-by-step process fundamentally changes the reliability of the model's outputs. "Traditional models often 'jump' on to a solution, which suggests they explore only a narrow portion of the reasoning space," he said. "In contrast, a reasoning-first approach forces the model to explicitly examine multiple intermediate steps… [allowing it] to traverse much deeper paths and arrive at answers with way more internal consistency."

The researchers used the OpenMMReasoner recipe to generate data to fine-tune the Qwen2.5-VL-7B-Instruct open-source vision-language model. The result’s a highly capable LMM that consistently outperforms state-of-the-art methods, equivalent to Open Vision Reasoner (OVR), across a big selection of multimodal reasoning benchmarks. The SFT stage alone creates a robust baseline model that achieves superior performance and data efficiency in comparison with other SFT approaches, despite using a significantly smaller training dataset.

The following RL phase further sharpens and stabilizes these abilities, resulting in more consistent and improved performance. After RL, the ultimate model achieves state-of-the-art results on several benchmarks, including WeMath, MathVerse and MathVista.

One among the important thing findings was that, because the model improved at multimodal reasoning, it also showed a "gradual emergence of textual reasoning behaviors, suggesting a transfer of reasoning competence from multimodal to purely linguistic domains," the researchers note. This means that skills learned in a single modality can strengthen performance in one other. 

"Our results show that strengthening multimodal reasoning may even improve text-only mathematical skills—evidence that core logical abilities can transfer across modalities," Zhang said. "Looking ahead, we do expect these methods to increase to video and audio."

The researchers also found that token efficiency is crucial. While allowing a model to generate longer reasoning steps can improve performance, excessive tokens reduce efficiency. Their results show that setting a smaller "reasoning budget" can achieve comparable and even higher accuracy, a very important consideration for deploying cost-effective enterprise applications.

By open-sourcing all components of their workflow, the researchers provide a reproducible view of your complete process. For enterprise teams, this transparency is invaluable. "For business leaders concerned about vendor lock-in, hidden biases or opaque data sources, this level of transparency is important," Zhang stated. "It empowers teams to validate the information, customize the pipeline for brand spanking new domains and maintain long-term independence from any single provider."



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x