“The development of mathematics toward greater precision has led, as is well known, to the formalization of large tracts of it, so that one can prove any theorem using nothing but a few mechanical rules.”
— K. Gödel
In Part 1, we built a proof checker and developed a mental model for why we should trust proofs that come out of an LLM: as long as we have formalized reasoning and a sound verifier, a “few mechanical rules” are all we need. So how do we train an LLM to generate valid proofs?
As DeepSeek beautifully showed, the same intuition behind AI learning the game of Go works for AI learning how to reason, as long as reasoning can be checked (and now we know it can). In this second part we put to good use our verifier and build an end-to-end RL training loop to fine-tune an open-source model to produce proofs in the language we introduced in part 1: at a glance, the following figure shows the basic ingredients of the flow.
The full implementation: from dataset generation with Sonnet to the RL loop on Tinker. [ Image by the author ]
TL;DR: after some machine-human collaboration to generate a dataset (leveraging our checker as a sanity check on LLM-generated examples), we run on Tinker an RL loop to do LoRA-style fine-tuning of open-source models. We prompt the model with (1) how our language works, (2) how to apply rules to build proofs, and (3) how to format answers so they’re easy to parse. Every proof is then run through the proof checker, and the reward gets propagated back to improve the model’s abilities: ideally, the model will start with mostly failing proof attempts, and then get progressively better as the training progresses.
Note that while the series specifically targets mathematical reasoning, verifiable proofs are fundamental in building confidence in distributed software systems. As some experts argued, AI may well be the missing ingredient for proving software correctness at scale!
Buckle up, clone the repo, and code along. If you skipped the first part, you can read it here!
Dataset generation
“People think mathematics is complicated. Mathematics is the simple bit. It’s the stuff we can understand. It’s cats that are complicated.” — J. Conway
To get a reward to improve our model, we need examples of proofs in the first place: ideally, we would like a mix of easy and hard proofs, written in our own reasoning language. We can’t just generate random strings in our alphabet because we’d like the model to try and prove things that we know are provable in the first place! How do we bootstrap the process?
Our training mixture is a combination of three sources:
- A manual translation of exercises (premises->conclusion) taken from forallx, which we assume are solvable proofs;
- A manual translation of exercises (premises->conclusion) taken from Language, Proof and Logic, which we assume are solvable proofs;
- A corpus of proofs generated by a powerful LLM (Sonnet by Anthropic). Since we cannot assume that LLM-generated premises->conclusion tuples are correct, we prompt the LLM to generate a full proof, which (you guessed it!) gets checked by our proof checker before being added to the training set.
A single observation in the dataset looks like the following object:
{“premises”: [“P”, “Q”], “conclusion”: “P and Q”, “num_steps”: 1}
i.e., a set of premises, a conclusion and how many steps Sonnet took to generate a valid proof: premises and conclusion will end up in the prompt during RL (as we will ask the model to find a proof of the conclusion from the premises), and num_steps is a convenient value to print out some statistics on the perceived difficulty of the training set (assuming for simplicity that the length of a proof loosely correlates with its difficulty).
Reinforcement Learning on Tinker
“The best way to have a good idea is to have lots of ideas.”
— attributed to L. “
We are now ready to get our own, smaller, open-source LLM for Vibe Proving. There are many recipes and services online to perform RL on open-source models, but we picked Tinker as it promises to abstract away the infrastructure and most of the boilerplate required (it is also the new kid on the block, so it’s a chance to test it out!).
The training loop itself doesn’t have many surprises:
- Sample: given the prompt and a tuple (premises->conclusion), we ask the model to generate multiple proof attempts.
- Verify: we run each attempt through the proof checker.
- Reward: valid proofs (i.e. proofs that are fully parseable and logically correct) get reward 1, every other outcome gets 0 (‘Do or do not‘, indeed). Note that we also check that the generated proof has the same (premises->conclusion) as our request, to avoid having the LLM easily gaming the system by always producing a trivially correct proof.
- Update: we adjust the model weights to make successful proofs more likely.
Following Tinker’s own guidelines, we choose to experiment with MoE reasoning models in a few sizes: gpt-oss-20b, gpt-oss-120b and Qwen3-30B-A3B-Instruct-2507. During training, logs and proofs are stored in the training_logs folder: at the end, our (vibe coded!) app can be used to visualize the metric trends and inspect the generated proofs.
Displaying training metrics from a 20b model using the vibe coded app. [ Screenshot from the author ]
If you are using an AI assistant to monitor the training (which I experimented with for the first time with this project), an interesting data slice to track is the proofs from textbooks, since they are designed to be tricky. As an example, the following is a status update from Claude Code:
AI-assisted monitoring, with a breakdown of performance over textbook examples. [ Screenshot from the author ]
How good is our vibe proving?
Across a few runs and a bit of tinkering with the parameters, we always end up with models that can prove the majority of the generated examples, but struggle on some textbook proofs. It is instructive and slightly amusing to inspect the generated proofs.
On the success side, this is an attempt at proving DeMorgan’s law, i.e. showing how to go from [‘not A or not B’] to not (A and B), by first assuming A and B and proving a contradiction:
- not A or not B (premise)
- | A and B (subproof)
- | A (2)
- | B (2)
- || not A (nested subproof, from 1)
- || ~ (3,5)
- || not B (nested subproof)
- || ~ (4,7)
- | (1, 5-6, 7-8)
- QED
On the failure side, no model successfully proved from ‘A or B’, ‘not A or C’, ‘not B or D’ that C or D , struggling to properly manage nested subproofs and apply the rule of explosion, as shown from this trace:
- A or B (premise)
- not A or C (premise)
- not B or D (premise)
- | A (subproof)
- || not A (nested subproof)
- || ~ (4,5)
- | C (5-6) ← ERROR
- ….
How easy was Tinker?
Our small proof of concept is hardly a stress test for a training service at scale, but it was enough to get some grounded impressions of the system.
The combination of good public examples, Claude-friendly documentation and hardware abstraction made for a pleasant, gentle introduction to RL, at a reasonable cost (all the experiments for the blog post cost $60 or so, including initial runs that – in hindsight! – were obviously a waste of time and money!).
When you get the hang of it and start to run a few jobs in parallel, the lack of monitoring and observability becomes an issue: sometimes my runs slowed down significantly (getting try_again responses for a long time, as if the system was overloaded), and some jobs failed at some point for unclear reasons (but, sure enough, you can restart from a previous checkpoint). Considering the reasonable price and the prototype nature of my workloads, none of these issues outweighed the pros, and I walked away with a positive enough Tinker experience that I would definitely use it again for a future project.
See you, RL cowboys!
“We do these things not because they are easy, but because we thought they were going to be easy.” — Anonymous
While Tinker indeed makes the training process (mostly) seamless, the devil is still in the (RL) details: we barely scratched the surface so far, as our goal was to go from zero to a Vibe Proving stack, not optimizing RL per se.
The good news is that the flow is fairly modular, so that all components could be improved and tinkered with (sort of) independently:
- model choice: model type, model size, provider …
- training parameters: pick learning rate, batch size, LoRA rank …
- code abstractions: re-write the code with RL Envs …
- prompt optimization: better instructions, easier formatting, useful in-context examples, …
- dataset optimization: more diverse examples, curriculum learning (not just varying the proof difficulty, but for example starting with proofs that are done except for one missing step, then proofs with two missing steps etc. until the model needs to fill the entire proof) …
In the same vein, our own custom proof language is definitely not enough to get interesting results: we could improve on it, but getting to something usable actually would require an astounding amount of work. For these reasons, you are better off migrating to a purpose-built language, such as Lean: importantly, now that you know about proofs-as-formalized-reasoning, the same mental model carries over to a language that is (way) more expressive. Moreover, Lean has pretty much the same style for writing down proofs, i.e. rules for introducing and eliminating logical operators.
In other words, once we nail the math behind Vibe Proving and build an initial RL harness, what is left is good ol’ engineering.
Acknowledgements
Thanks to Patrick John Chia, Federico Bianchi, Ethan Rosenthal, Ryan Vilim, Davis Treybig for precious feedback over previous versions of this draft.
If you like the intersection of genAI, reasoning about distributed systems and verification, you can also check out our research at Bauplan.
AI coding assistants were used to write the companion repository, but no assistant was used to write the text (if not for proof-reading and typo correction).

