A slimmed-down training pipeline from Kimina Prover, with core features and full compatibility with verl.
We’re completely satisfied to introduce kimina-prover-rl, an open-source training pipeline for formal theorem proving in Lean 4, based on a structured reasoning-then-generation paradigm inspired by DeepSeek-R1.
This training pipelinee is a simplified version of the system we used to coach Kimina Prover, preserving the important thing components of the system and offering full compatibility with the open-source Verl framework.
It’s released as a part of a fork of Verl containing the entire training recipe in recipe/kimina-prover-rl, allowing anyone to breed our experiments or adapt the setup to their very own models and datasets. All information to setup and launch the pipeline could be present in the README of the recipe.
In consequence of this training pipeline, we’re releasing two models:
- AI-MO/Kimina-Prover-RL-1.7B, a 1.7B-parameter model that achieves 76.63% Pass@32 on the MiniF2F benchmark — setting a brand new cutting-edge for open-source models on this size category
- AI-MO/Kimina-Prover-RL-0.6B, a 0.6B-parameter model that achieves 71.30% Pass@32 on the MiniF2F benchmark — also setting a brand new cutting-edge for open-source models on this size category.
Introduction
kimina-prover-rl is a training pipeline designed to show large language models to unravel formal proof goals in Lean 4, using a two-stage output structure: a natural language reasoning trace followed by corresponding Lean code.
This paradigm, inspired by DeepSeek-R1, enables the model to separate planning from execution, promoting explainability, error recovery, and stronger generalization.
To coach models under this reasoning framework, we apply GRPO — a reinforcement learning approach tailored for LLMs. This open-source version of the training pipeline of Kimina-prover is implemented using the RL library Verl.
Throughout the rollout phase of GRPO, the model generates N outputs for every prompt. A reward of 1 is assigned to any output whose Lean code is successfully verified by Lean using our kimina-lean-server.
Two fundamental features are added to this framework:
- A format checking reward to show the model to structure its outputs
- An error correction turn to encourage the model to learn from failure signals
Kimina-Client
During training, numerous Lean 4 proof candidates have to be verified concurrently. To handle this efficiently, we require a high-throughput verification system.
To fulfill this need, Numina and Kimi have developed an open-source server called kimina-lean-server, which supports parallel proof checking at scale using Lean 4.
To simplify integration, we also provide kimina-client, a light-weight Python package (available on PyPI) that gives a clean interface for interacting with the server’s API.
Dataset
We train using Kimina-Prover-Promptset, a curated subset of the NuminaMath-LEAN dataset.
For this training setup, we filter and preprocess the dataset as follows:
- Remove easy problems with a historical win rate above 0.5 to only keep difficult statements within the dataset.
- Generate variants of existing problems to extend diversity using Gemini
- Duplicate hard problems to offer them more weight during training
The resulting dataset accommodates difficult, high-value problems for improving Lean 4 theorem proving models.
NuminaMath-LEAN-RL can also be the dataset used to coach AI-MO/Kimina-Prover-RL-1.7B and AI-MO/Kimina-Prover-RL-0.6B.
Example input format:
Take into consideration and solve the next problems step-by-step in Lean 4.
# Problem:
Find all primes which might be the difference of the fourth powers of two integers.
# Formal Statement:
'''lean4
import Mathlib
theorem number_theory_4487 : p.Prime ∧ ∃ a b, p = a ^ 4 - b ^ 4 = ∅ := by
'''
Format reward
The core idea of our reasoning training pipeline is to structure the LLM output into two stages. One considering block followed by one lean4 block:
- A reasoning block ( … )
- A Lean 4 code block
To prove the statement, we use induction on n.
The bottom case is trivial, and the inductive step follows by applying the hypothesis.
'''lean4
theorem my_thm : ∀ n, f n = g n := by
induction n with
| zero => simp
| succ n ih => simp [ih]
'''
Each rollout is verified to be certain that this format is respected. If the output is malformed — e.g., missing the block or misplacing the code — the model receives a zero reward, no matter whether the proof is definitely valid.
This enforces consistency and teaches the model to structure its outputs reliably.
In kimina-prover, these checks transcend simply verifying the presence of and lean4 blocks:
- Ensuring there is strictly one
block and one lean4 code block per output.... - Rejecting outputs with repetitive reasoning lines, which regularly indicate hallucinated or degenerate generations.
- Checking that tactic blocks contained in the considering section are present in sufficient number and contain enough non-comment lines.
- Applying thresholds on comment density (in each reasoning and Lean code), to penalize overly verbose or boilerplate outputs.
- Comparing the semantic alignment between tactics described in blocks and the ultimate Lean code using an identical rating (e.g., Intersection-over-Union or subcode coverage).
- Penalize unnecessarily long responses, encouraging the model to make use of tokens more efficiently while still giving complete replies
Only generations that pass all these checks are considered well-formatted and might receive a reward. This structured filtering improves training stability and encourages clean reasoning.
Error correction
To make training more informative, we’ve got added an error correction mechanism that provides the model a probability to repair its own failed proofs.
When a rollout fails (e.g., on account of a Lean error or incorrect proof), we:
- Store the complete prompt, response, and Lean feedback.
- Create a recent training sample where the model is explicitly prompted to revise its previous reasoning/code.
This encourages the model to learn from failure signals as lean feedback is provided through the training.
It also enables multi-turn interaction chains, where feedback from Lean is injected as a part of the prompt, and the model is rewarded for successfully debugging its own output.
Because multi-turn responses can get long, we allow just one error-fix turn and cap the error message at a set variety of tokens.
Overview of the pipeline
The work Understanding R1-Zero-Like Training: A Critical Perspective claims there’s optimization bias in GRPO, that results in artificially longer responses, especially for incorrect outputs.
We also noticed this behaviour during our experimentations and we used DrGPO for our optimization. DrGRPO aggregates token-level losses by normalizing with a worldwide constant to eliminate length bias.
The configuration file provided within the repository is for a 8 GPUs setup.
The model that we’re finetunning is the AI-MO/Kimina-Prover-Distill-1.7B. This model is a finetuned version of Qwen/Qwen3-1.7B with cold start data generated from our AI-MO/Kimina-Prover-72B model.
At every step, 256 samples are fetched within the training dataset. One out of two is an error correction sample. We generate 8 rollout per samples so 2048 generations. You possibly can increase to 16 or 32 rollouts when you are using a couple of node.
We evaluate the model every 5 training steps, using one of the best@8 metric from verl to have fast validation steps. You possibly can increase to best@16 or 32 when you are using a couple of node. We’re evaluating the performance before and the after the error correction turn. For every failed reponse we allow the model to do yet another tentative to repair its proof.
Results
After just a few training steps, we observe a consistent improvement in performance. On this section we’ll discuss the training metrics after 48 hours of coaching on 8 H100 GPUs.
By step 85, the pipeline improves the model’s accuracy by 4 points, reaching 70% for one of the best@8 metric and 74% after the error correction turn:
In parallel, we observe that the variety of format errors steadily decreases over the course of coaching, indicating that the model is learning to supply structurally valid outputs.
Finally, and as expected under the DeepSeek-R1-style training setup, the common token length of the model’s outputs increases with training — a signal that the model is learning to reason in longer, more structured traces.
After the training, we evaluated the model using pass@32 with and without error fixing. We were capable of improve the performances of our 1.7B model by greater than 3% at pass@32 on MiniF2F:
| Model | Pass@32 | Pass@32 with error fixing |
|---|---|---|
| AI-MO/Kimina-Prover-Distill-1.7B | 72.95% | 75.41% |
| AI-MO/Kimina-Prover-RL-1.7B | 76.23% | 77.87% |
Using this training pipeline we also finetuned a 0.6B model, improving its performances by greater than 2%.
| Model | Pass@32 |
|---|---|
| AI-MO/Kimina-Prover-Distill-0.6B | 68.85% |
| AI-MO/Kimina-Prover-RL-0.6B | 71.30% |
Conclusion
With Kimina-Prover-RL, we offer a light-weight yet powerful reinforcement learning pipeline for training Lean 4 theorem provers.
By combining structured reasoning, format rewards, and error correction, we achieve state-of-the-art results for open-source models within the 0.6B–1.7B parameter range.
Alongside the models, we’re also releasing a fork of Verl containing the complete training recipe in recipe/kimina-prover-rl, so the community can reproduce our results or adapt the pipeline to their very own datasets and models.
We hope this release will function a solid foundation for the community to experiment with RL training in formal reasoning, and to push the boundaries of open-source automated theorem proving in Lean 4.




