Implementing Vibe Proving with Reinforcement Learning

“The event of mathematics toward greater precision has led, as is well-known, to the formalization of enormous tracts of it, in order that one can prove any theorem using nothing but a couple of mechanical rules.”
— K. Gödel

In Part 1, we built a proof checker and developed a mental model for why we should always trust proofs that come out of an LLM: so long as now we have formalized reasoning and a sound verifier, a “few mechanical rules” are all we want. So how will we an LLM to generate valid proofs?

As DeepSeek beautifully showed, the identical intuition behind AI learning the sport of Go works for AI learning the best way to reason, so long as reasoning might be checked (and now we all know it could possibly). On this second part we put to good use our verifier and construct an end-to-end RL training loop to fine-tune an open-source model to provide proofs within the language we introduced partly 1: at a look, the next figure shows the essential ingredients of the flow.

The total implementation: from dataset generation with Sonnet to the RL loop on Tinker. [ Image by the author ]

TL;DR: after some machine-human collaboration to generate a dataset (leveraging our checker as a sanity check on LLM-generated examples), we run on Tinker an RL loop to do LoRA-style fine-tuning of open-source models. We prompt the model with (1) how our language works, (2) the best way to apply rules to construct proofs, and (3) the best way to format answers in order that they’re easy to parse. Every proof is then run through the proof checker, and the reward gets propagated back to enhance the model’s abilities: ideally, the model will start with mostly failing proof attempts, after which get progressively higher because the training progresses.

Note that while the series specifically targets mathematical reasoning, verifiable proofs are fundamental in constructing confidence in distributed software systems. As some experts argued, AI may possibly be the missing ingredient for at scale!

Buckle up, clone the repo, and code along. In the event you skipped the primary part, you possibly can read it here!

Dataset generation

“People think mathematics is complicated. Mathematics is the straightforward bit. It’s the stuff we will understand. It’s cats which can be complicated.” — J. Conway

To get a reward to enhance our model, we want examples of proofs in the primary place: ideally, we would really like a mixture of easy and hard proofs, written in our own reasoning language. We are able to’t just generate random strings in our alphabet because we’d just like the model to try to prove things that we all know are provable in the primary place! How will we bootstrap the method?

Our training mixture is a mixture of three sources:

A manual translation of exercises (premises->conclusion) taken from forallx, which we assume are solvable proofs;
A manual translation of exercises (premises->conclusion) taken from Language, Proof and Logic, which we assume are solvable proofs;
A corpus of proofs generated by a strong LLM (Sonnet by Anthropic). Since we cannot assume that LLM-generated premises->conclusion tuples are correct, we prompt the LLM to generate a , which (you guessed it!) gets checked by our proof checker before being added to the training set.

A single commentary within the dataset looks like the next object:

{"premises": ["P", "Q"], "conclusion": "P and Q", "num_steps": 1}

i.e., a set of premises, a conclusion and what number of steps Sonnet took to generate a sound proof: premises and conclusion will find yourself within the prompt during RL (as we are going to ask the model to seek out a proof of the conclusion from the premises), and num_steps is a convenient value to print out some statistics on the perceived difficulty of the training set (assuming for simplicity that the length of a proof loosely correlates with its difficulty).

Reinforcement Learning on Tinker

“The perfect technique to have a great idea is to have a number of ideas.”
— attributed to L. “

We are actually able to get our own, smaller, open-source LLM for Vibe Proving. There are lots of recipes and services online to perform RL on open-source models, but we picked Tinker because it guarantees to abstract away the infrastructure and a lot of the boilerplate required (it’s also the brand new kid on the block, so it’s a probability to check it out!).

The training loop itself doesn’t have many surprises:

Sample: given the prompt and a tuple (premises->conclusion), we ask the model to generate multiple proof attempts.
Confirm: we run each attempt through the proof checker.
Reward: valid proofs (i.e. proofs which can be fully parseable and logically correct) get reward 1, every other final result gets 0 (‘Do or don’t‘, indeed). Note that we also check that the generated proof has the identical (premises->conclusion) as our request, to avoid having the LLM easily gaming the system by all the time producing a trivially correct proof.
Update: we adjust the model weights to make successful proofs more likely.

Following Tinker’s own guidelines, we decide to experiment with MoE reasoning models in a couple of sizes: gpt-oss-20b, gpt-oss-120b and Qwen3-30B-A3B-Instruct-2507. During training, logs and proofs are stored within the training_logs folder: at the tip, our (vibe coded!) app might be used to visualise the metric trends and inspect the generated proofs.

Displaying training metrics from a 20b model using the vibe coded app. [ Screenshot from the author ]

In the event you are using an AI assistant to observe the training (which I experimented with for the primary time with this project), an interesting data slice to trace is the proofs from textbooks, since they’re to be tricky. For example, the next is a standing update from Claude Code:

AI-assisted monitoring, with a breakdown of performance over textbook examples. [ Screenshot from the author ]

How good is our vibe proving?

Across a couple of runs and a little bit of tinkering with the parameters, we all the time find yourself with models that may prove the vast majority of the generated examples, but struggle on some textbook proofs. It’s instructive and barely amusing to examine the generated proofs.

On the success side, that is an attempt at proving DeMorgan’s law, i.e. showing the best way to go from ['not A or not B'] to not (A and B), by first assuming A and B and proving a contradiction:

not A or not B (premise)
| A and B (subproof)
| A (2)
| B (2)
|| not A (nested subproof, from 1)
|| ~ (3,5)
|| not B (nested subproof)
|| ~ (4,7)
| (1, 5-6, 7-8)
QED

On the failure side, no model successfully proved from 'A or B', 'not A or C', 'not B or D' that C or D , struggling to properly manage nested subproofs and apply the rule of explosion, as shown from this trace:

A or B (premise)
not A or C (premise)
not B or D (premise)
| A (subproof)
|| not A (nested subproof)
|| ~ (4,5)
| C (5-6) ← ERROR
….

How easy was Tinker?

Our small proof of concept is hardly a stress test for a training service at scale, nevertheless it was enough to get some grounded impressions of the system.

The mixture of excellent public examples, Claude-friendly documentation and hardware abstraction made for a pleasing, gentle introduction to RL, at an inexpensive cost (all of the experiments for the blog post cost $60 or so, including initial runs that – in hindsight! – were obviously a waste of money and time!).

If you get the hang of it and begin to run a couple of jobs in parallel, the dearth of monitoring and observability becomes a difficulty: sometimes my runs slowed down significantly (getting try_again responses for a very long time, as if the system was overloaded), and a few jobs failed sooner or later for unclear reasons (but, sure enough, you possibly can restart from a previous checkpoint). Considering the reasonable price and the prototype nature of my workloads, none of those issues outweighed the professionals, and I walked away with a positive enough Tinker experience that I might definitely use it again for a future project.

See you, RL cowboys!

“We do this stuff not because they’re easy, but because we thought they were going to be easy.” — Anonymous

While Tinker indeed makes the training process (mostly) seamless, the devil continues to be within the (RL) details: we barely scratched the surface thus far, as our goal was to go from zero to a Vibe Proving stack, not optimizing RL .

The excellent news is that the flow is fairly modular, so that every one components may very well be improved and tinkered with (type of) independently:

model alternative: model type, model size, provider …
training parameters: pick learning rate, batch size, LoRA rank …
code abstractions: re-write the code with RL Envs …
prompt optimization: higher instructions, easier formatting, useful in-context examples, …
dataset optimization: more diverse examples, curriculum learning (not only various the proof difficulty, but for instance starting with proofs which can be done apart from one missing step, then proofs with two missing steps etc. until the model must fill your complete proof) …

In the identical vein, our own custom proof language is certainly not enough to get interesting results: we improve on it, but attending to something usable actually would require an astounding amount of labor. For these reasons, you’re higher off migrating to a purpose-built language, similar to Lean: importantly, now that you recognize about proofs-as-formalized-reasoning, the identical mental model carries over to a language that’s (way) more expressive. Furthermore, Lean has just about the same style for writing down proofs, i.e. rules for introducing and eliminating logical operators.

In other words, once we nail the mathematics behind Vibe Proving and construct an initial RL harness, what’s left is sweet ol’ engineering.

Acknowledgements

Because of Patrick John Chia, Federico Bianchi, Ethan Rosenthal, Ryan Vilim, Davis Treybig for precious feedback over previous versions of this draft.

In the event you just like the intersection of genAI, reasoning about distributed systems and verification, you may as well try our research at Bauplan.

AI coding assistants were used to write down the companion repository, but no assistant was used to write down the text (if not for proof-reading and typo correction).

Implementing Vibe Proving with Reinforcement Learning

Dataset generation

Reinforcement Learning on Tinker

How good is our vibe proving?

How easy was Tinker?

See you, RL cowboys!

Acknowledgements

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Exciting Changes Are Coming to the TDS Creator Payment Program

I checked out considered one of the largest anti-AI protests ever

OpenAI steps into Anthropic’s Pentagon void

Zero-Waste Agentic RAG: Designing Caching Architectures to Minimize Latency and LLM Costs at Scale

Context Engineering as Your Competitive Edge

Implementing Vibe Proving with Reinforcement Learning

Dataset generation

Reinforcement Learning on Tinker

How good is our vibe proving?

How easy was Tinker?

See you, RL cowboys!

Acknowledgements

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.