Apriel-1.6-15b-Thinker: Cost-efficient Frontier Multimodal Performance

-



We release Apriel-1.6-15b-Thinker, a 15-billion parameter multimodal reasoning model in ServiceNow’s Apriel SLM series which achieves SOTA performance against models 10 times it’s size. Apriel-1.6 builds on top of Apriel-1.5-15b-Thinker with an in depth deal with improving text and vision reasoning, while improving token efficiency. This version was trained on NVIDIA DGX™ Cloud with GB200 Grace™ Blackwell Superchips.

Apriel-1.6 scores 57 on the Artificial Evaluation Index, outperforming models like Gemini 2.5 Flash, Claude Haiku 4.5 and GPT OSS 20b. It obtains a rating on par with Qwen3 235B A22B, while being signficantly more efficient. This latest release improves or maintains task performance compared with the previous Apriel-1.5-15B-Thinker [1], while reducing reasoning token usage by greater than 30%.

Artificial Analysis Intelligence Index (30 Nov '25)



Mid-Training

We follow the identical overall training process used for Apriel-1.5-15B-Thinker, which incorporates a depth-upscaling phase followed by two Continual Pretraining (CPT) stages (detailed in [1]). The depth-upscaling corpus consists of 35% data from diverse sources, including high-quality web content, scientific and technical literature, mathematical problem sets, and programming code; 15% high-quality datasets from NVIDIA Nemotron™; and the remaining 50% pretraining-style data serving as replay.

For Apriel-1.6-15B-Thinker, we expand the Stage-1 CPT mixture, which focuses on strengthening textual reasoning and image understanding, with additional text-only samples and image-text pairs. The brand new text data is fully synthetic, covering general reasoning, knowledge, coding, and inventive writing, while the multimodal portion spans document and chart understanding, OCR, visual-reasoning tasks, and SVG/web-code synthesis.

Following Stage-1, we perform a text-only CPT run at an prolonged 49K sequence length after which run Stage 2 to further refine the model’s visual-reasoning capabilities. This mix produced a powerful base model that provided a solid foundation for subsequent post-training. Training for this mid-training pipeline required roughly 10,000 GPU hours on NVIDIA’s GB200s, a small compute footprint enabled by their high throughput and aligned with our goal of constructing strong models with limited resources through careful data strategy and training methodology.



Post-Training

Using the midtrained model, we perform post-training following a pipeline that consists of huge scale Supervised Finetuning (SFT) and Reinforcement Learning (RL) targeting each vision and text abilities.



Supervised Finetuning (SFT)

Our Supervised High-quality-Tuning (SFT) stage focuses on improving the reasoning quality of Apriel-1.6 by training on a meticulously curated dataset of two.4 million high-signal text samples. Each example includes explicit, step-by-step reasoning traces, enabling the model to internalize transparent reasoning processes moderately than merely reproducing final answers.

To construct this dataset, we combined execution-verifiable synthetic samples for math, coding, and scientific problem-solving with a broad mixture of instruction-following, conversational, API/function-calling, creative writing, safety, and other knowledge-intensive samples. Data quality was treated as a first-class priority: every sample passed through multi-stage de-duplication, content filtering, heuristic quality pruning, LLM-as-Judge validation, execution-based verification (where applicable), and strict decontamination against evaluation benchmarks.

SFT was carried out in two phases, each trained at a 32K context length. In the primary phase, we ran a large-scale text-only training run on the two.4M samples for 4 epochs. In comparison with Apriel-1.5-15b-Thinker, we simplified the chat template by removing redundant tags and introduced 4 special tokens to the tokenizer (, , [BEGIN FINAL RESPONSE], <|end|>) for easier output parsing.

The second phase was a light-weight, multimodal run trained for 3 epochs, using rejection-sampled data from Apriel-1.5-15b-Thinker to make sure the model maintained strong performance on image inputs after the introduction of those special tokens, while also preparing it for downstream RL stages.

This approach provided us with a strong, high-quality SFT foundation on top of which our RL pipeline could operate effectively. The resulting model exhibits strong multimodal understanding, improved text reasoning capabilities, and enhanced agentic behavior.



Reinforcement Learning (RL)

We adopt a multi-stage RL setup that focuses on concurrently improving reasoning capability and efficiency.
We train the model on image domains comparable to visual reasoning, general visual query answering (VQA) and optical character recognition (OCR). Our training data also consists of knowledge across different domains, comparable to easy questions (to encourage short, direct answers on easy queries), math (numerical reasoning), STEM (multiple-choice scientific questions), and performance calling (structured tool use).

Rewards are given for correctness of the response, together with penalties for undesirable behaviour, comparable to verbosity, incorrect formats, etc. Overall, our setup is designed to enhance the model’s reasoning ability while using fewer reasoning tokens, encouraging it to avoid unnecessary intermediate steps, stop earlier when confident, and answer more directly for less complicated queries.

Training is finished with the Group Sequence Policy Optimization loss (GSPO) [2] using the VeRL framework and rule-based verification.



Evaluation



Text Evaluation

We evaluate Apriel-1.6 on various domains comparable to tool use, math, coding, instruction following and long context.

* This rating is with DCA enabled. Without this, the model scores 36.

** The typical rating is calculated using all benchmarks except BFCL v3 Only and DeepResearchBench, since some models should not have scores for these two benchmarks.

*** AA LCR rating for o3-mini-high is projected rating based on its AA Index rating.



Image Evaluation

We evaluate the Apriel-1.6 model on a representative set of evaluations with the prime deal with mathematical reasoning, visual query answering, logical reasoning, STEM related tasks and chart based reasoning. All evaluations are done using VLMEvalkit. Apriel-1.6 improves on its predecessor by 4 points on the typical of 13 benchmarks of the Image Index comprising of the next benchmarks: MathVision, MathVista, MMMU (validation), MMMU-Pro (10 alternative COT), MMMU-Pro (Vision only COT), MathVerse (Vision Dominant), MathVerse (Text Dominant), MMStar, BLINK, LogicVista, CharXiV (descriptive), CharXiV (reasoning), AI2D (test).

Performance on the Image Index



Cost-Efficient Frontier Performance

Intelligence vs Total Parameters (30 Nov '25)

Apriel-1.6-15B-Thinker sits within the sweet spot of the cost-efficient frontier. It delivers intelligence scores that rival or surpass much larger models while using only 15B parameters. On the chart, it’s firmly contained in the most tasty quadrant, balancing efficiency with top-tier reasoning. In practice, this implies Apriel-1.6-15B-Thinker offers strong performance and deep reasoning at a fraction of the compute and deployment cost of heavyweight competitors, making it an exceptionally efficient alternative for the real-world, especially in enterprise applications.

Intelligence vs Output Tokens Used in Artificial Analysis Intelligence Index (30 Nov '25)

Our post-training focuses heavily on improving reasoning-token efficiency. The image above showing intelligence rating against token usage highlights the effectiveness of our post-training. Apriel-1.6-15B-Thinker again lands in most tasty quadrant. The model reaches a high Artificial Evaluation Intelligence Index rating while using far fewer tokens than many similarly capable or larger models. Compared to Apriel-1.5-15b-Thinker [1], we reduce token usage by over 30%.

Overall, Apriel-1.6 is a highly-capable reasoner, that maintains memory and efficiency characteristics required for enterprise deployment.



Acknowledgements

We gratefully acknowledge the next people for his or her contributions: Varun Pandey, Shashank Maiya, Dhruv Jhamb, Massimo Caccia, Dheeraj Vattikonda, Nicolas Gontier, Patrice Bechard, Tayfun Tuna, Kavya Sriram, Denis Akhiyarov, Hari Subramani, Tara Bogavelli.



Notes and Limitations

We’re a small lab with big goals. While we will not be GPU poor, our lab, compared has a tiny fraction of the compute available to other Frontier labs. Our goal with this work is to indicate that a SOTA model may be built with limited resources if you may have the best data, design and solid methodology.

We got down to construct a small but powerful model, aiming for capabilities on par with frontier models. Developing a 15B model with this level of performance requires tradeoffs, so we prioritized getting SOTA-level performance and improving reasoning token efficiency.

This model is trained to perform extensive reasoning for difficult questions and fewer reasoning effort for less complicated questions. We’re at all times actively working to make our models more efficient and concise in future releases.

The model has a number of vision-related limitations to concentrate on. Complex or low-quality images can reduce OCR accuracy, dense scenes (like crowds or many similar objects) could make subtle details and counting more difficult, and highly detailed or unusually formatted charts may occasionally result in imperfect interpretations. It might also be less precise with fine-grained visual grounding, so bounding-box predictions can sometimes be approximate or inconsistent.



References

[1] Radhakrishna, S., Tiwari, A., Shukla, A., Hashemi, M., Maheshwary, R., Malay, S.K.R., Mehta, J., Pattnaik, P., Mittal, S., Slimi, K., Ogueji, K., Oladipo, A., Parikh, S., Bamgbose, O., Liang, T., Masry, A., Mahajan, K., Mudumba, S.R., Yadav, V., Madhusudhan, S.T., Scholak, T., Davasam, S., Sunkara, S. and Chapados, N., 2025. Apriel-1.5-15b-Thinker. arXiv preprint arXiv:2510.01141.

[2] Zheng, C., Liu, S., Li, M., Chen, X.-H., Yu, B., Gao, C., Dang, K., Liu, Y., Men, R., Yang, A., Zhou, J. and Lin, J., 2025. Group Sequence Policy Optimization. arXiv preprint arXiv:2507.18071.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x