“stochastic parrots” to AI models winning math contests? While there may be definitely doubt that LLMs are truly PhD-level thinkers as advertised, the progress in complex reasoning situations is undeniable.
A popular trick has been to combine and match LLM generative capabilities with formal verifiers, i.e. purpose-built software that gives guaranteed solutions to certain problems, when stated precisely. The important thing insight is that LLMs could also be good at translating messy, ambiguous requirements into precise formal specifications. Formal verifiers excel at finding solutions that satisfy those specifications. By combining them, we get a system that may understand what you would like guarantee it delivers exactly that: recently, AWS is using this very trick to construct “guardrails” for real time chats.
How does this work in practice? Unfortunately, the reason of those basic dynamics often happens inside larger, complex contexts, like reinforcement learning or mathematical proofs. Today, we’ll display this hybrid approach using Alloy, a light-weight language that’s trivial to read, even for beginners. As a substitute of the standard math-y papers and hard-to-grasp benchmarks, we’re going to solve a far more relatable challenge, inspired by a weekly crossword publication:
Now we have: 5 cars (1-5) parked in front of 5 girls (A-E), and 5 names (Laura, Giovanna, Bianca, Franca, Marta); we don’t know which automotive was parked by which girl but the women say something in regards to the situation. Our task is to reply this deceptively easy query: which girl is known as Marta and what’s her automotive?
While more beach-level than PhD-level considering, the answer sits at a sweet spot of complexity. It may well provide a primer on LLM and formal methods that just isn’t polluted by other themes and doesn’t require extensive domain knowledge: we keep all the essential ingredients of real-world problems, but simplify the setup.
Prompts, screenshots, and Alloy code can be found in this open source repo (all tests have been done within the week of August 2025, the primary reasoning loop has been done with Opus 4.1 on Claude Desktop).
AIs and humans struggle by themselves
A fun fact about our puzzle is that, although it requires only “beach-level considering”, top models usually are not obviouslygood at it. Uploading the original picture and prompting Opus 4.1 for an answer, the model incorrectly assumed C is wearing pants: how can we then trust its conclusion – that Marta is Girl A, and her automotive is number 5?
Things get interesting once we try to check models. We abstract away the puzzle in a textual descriptionbut LLMs still cannot find consensus: DeepSeek’s 4.1 answer (A and a couple of) is different than the one given by Opus; Opus’s own answer with textual prompting (A and a couple of) is different from Opus above, and ChatGPT5 has one more opinion (A and 5).
That is what makes the puzzle a fantastic motivating example: humans struggle at this combinatorial reasoning (homework query: how long did it take to resolve it?), but it surely’s unclear how significantly better frontier models are. How will we construct confidence in any of the answers above? How can we reason as a substitute of delegating entirely the method?
Reasoning as “eliminating possibilities”
Complex reasoning challenges can often be solved following the recommendation from that famous detective: ‘When you will have eliminated the unattainable, then whatever stays, nonetheless improbable, have to be the reality’. As a substitute of trying to resolve the issue suddenly, we will consider our puzzle as the mix of three primary things:
- An initial situation, randomly mapping girls to cars and labels.
- A set of constraints, in the shape of statements by the exact same girls: these statements will ensure mapping unattainable.
- A final situation, during which girls are re-mapped to names and cars.
Our initial knowledge is compatible with this reality:

But additionally this (and plenty of more):

We will imagine that each time we add a woman statement, we eliminate some arrangements from possibly being the ultimate one. In other words, we increase our knowledge in regards to the situation as we progressively restrict the set of feasible solutions (this basic insight is identical underlying epistemic logic and data theory). In actual fact, the very first statement, “Girl A states that Laura just isn’t next to her, and A’s automotive is now in front of Bianca”, rules out our first scenario, because Laura is next to Girl A there.
Enumerating scenarios is a tedious and error-prone task, even for LLMs. The magic of Alloy is their declarative nature. As a substitute of writing down the reasoning code ourselves, we state (premises in a conventional proof, statements on this case), and (a theorem, Marta’s automotive), and let Alloy do the remainder: exploring an enormous conceptual space is finished by tried and tested methods, in order that we will give attention to the faithful translations of the puzzle and (necessary!) the interpretation of the instances Alloy finds.
The division of labor should now be clear: as a substitute of LLM (or us) directly solving the issue, we translate English requirements in Alloy code with Claude, then use Alloy to generate solutions and eventually, we, as humans, check them.
From LLM to Alloy and back: the reasoning loop
Our prompting strategy is now more subtle. We now not ask Claude for a direct solution; as a substitute, our prompt guides it to generate Alloy code based on our initial scenario. As a substitute of “one-shotting” the answer, we at the moment are in a virtuous loop, generating increasingly complex code, and verifying that we’re getting closer based on the Alloy output:

The result’s our starting code, which comprises the primary ingredients but no constraints yet. It is straightforward to scroll through the definitions now that the tedious translation has been done: Girl, Automobile, and Name as our primary “signatures” (i.e. sets of objects), and the initial position for Girls A-E is the mapping to Cars 1-5. We don’t yet know who owns what except that no person owns the automotive in front of them now:
// No girl is initially standing in front of her own automotive
// Girl A (position 1) doesn't own Car1, B doesn't own Car2, etc.
A.owns != Car1
B.owns != Car2
C.owns != Car3
D.owns != Car4
E.owns != Car5
We pause here to spotlight two great Alloy features: first, the code maps clearly to logical statements, quite just like the ones to be present in mathematical proofs and informal reasoning – even when you will have never seen Alloy’s syntax before, the statements ought to be obvious (code comments are your friend!). Second, the built-in UI is beneficial to visualise our progress, because it depicts an chosen amongst all of the possible realities that satisfy the constraints: for instance, it is a possible task (Giovanna is C):

Executing it again, we could get one other one, after which one other one: as our knowledge is proscribed at this stage, are all possible: it’s time to begin eliminating some!
Let’s ask Claude to switch our initial code, and add the statement from girl A. The beauty of this loop is that we may encode “sanity checks” based on incomplete but sound reasoning. Not only LLMs, but in addition human intelligence advantages from this form of “progressive enhancement”: with the ability to incorporate “local” constraints is each unit testing the Alloy model in addition to engaging us directly with the puzzle.
Let’s now add the statement by Girl A as a constraint. Now add a check to substantiate that the next mapping just isn’t allowed anymore: Franca (A, 1), Laura (B, 2). If we now run the code, no counterexample is found, proving we successfully excluded the undesired configuration:
pred InvalidConfiguration {
// Girl A is known as Franca and owns Car1
A.name = Franca
A.owns = Car1
// Girl B is known as Laura and owns Car2
B.name = Laura
B.owns = Car2
}
check { not InvalidConfiguration } for five Int
Now that we all know the trick, our AI assistant can generate the script with all of the statements by the women. After we run it, that is the instance that we get:

Because of a number of iterations and interpretable, provably correct reasoning we will now establish that ChatGPT5 got this right: Marta is Girl A in Automobile 5, and the mapping provided by ChatGPT is correct (you possibly can confirm it yourself comparing the chat result with the instance above – incidentally this also proves one other interesting fact, which is: regardless of Marta’s mapping, are the opposite girls uniquely determined as well?).
Reasoning out of the box
An amazing side-product of getting independently computable representations of the concepts at hand is that now we will explore within the space of Alloy the underlying mechanics of the puzzle, as a substitute of relying entirely on opaque mappings in space.
For instance, we will easily confirm that the answer is exclusive: within the Alloy UI, if you happen to attempt to get a brand new instance, a warning says that no other instance is on the market. But we could also explore outside the present boundaries, and take away all of the Clothing information: does the answer change? (Try to reply before running it!) It seems, the right solution remains to be a legitimate instance (homework query: why must this be the case?), but this time the UI can indeed produce multiple valid instances: as expected, less constraints, (likely) more solutions.
A symbolic space that we easily manipulate can also be great for checking the work of AI, which should never be taken at face value. The primary point in case is checking Opus’ solution at first, obtained by parsing the image incorrectly. We will easily change girl C’s clothing (i.e. `C.wears = Trousers`) and take a look at again: since there is no such thing as a solution, the (sad) conclusion is that Opus’ original reasoning was incorrect – it was “right” but for the “fallacious” reasons, so to talk.
A second example comes from what Claude added to ascertain for uniqueness (i.e.: Marta is A and 5 in all valid configurations). In theory, that’s a pleasant addition, but in practice this check doesn’t do the job:
assert MartaUniqueSolution
(g1.name = Marta and g2.name = Marta) implies
(g1 = g2) // Marta is all the time at the identical position
The mismatch is obvious, and straightforward to discover due to Alloy’s clear syntax: “In all valid configurations” is a quantifier over all instances (within the “meta-language” so to talk), while “all g1…” quantifies over girls an instance.
See you, space cowboys
Similarly to cutting-edge systems like AlphaGeometry, we solved a deductive problem (effectively, a ) by reasoning with Claude, as a substitute of delegating the method entirely.
The LLM does the mapping between English and a proper language: Alloy is straightforward to read, but sometimes tedious to jot down, so the code generation capabilities of Claude turn out to be useful. Humans, alternatively, can give attention to checking if the formal setup is correct (checking is commonly easier than doing in the primary place!). Each Claude and humans then delegate combinatorial reasoning to a robust, verified solver for the actual deduction.
While our beach-level proof seems unimportant, and the copy-paste from Claude gets tedious quickly, this straightforward example is a glimpse of the ability of formal methods when combined with code generation and a few (human or agentic) supervision. Real-world systems use more expressive languages, run tighter, self-improving loops and goal less frivolous proofs, but most of the intuitions from today carry over to them.
After all, solving beach-or-PhD logical puzzles just isn’t the one use case for hybrid systems resembling this one. Languages like Alloy are extremely popular for modelling software programs, and as such, they open the door for a future during which distributed systems could be cheaply designed and verified at scale before any implementation work even begins. As very practical examples, AWS notoriously invests in verifying their cloud products, and Bauplan provides an Alloy model for their very own data catalog primitives.
Taking a really different path than what many could have predicted even just 50 years ago, it seems, day-to-day, that we’re finally getting closer to Leibniz’s dream:
Acknowledgments
Because of Federico Bianchi, Aldrin Montana, Patrick John Chia for preliminary feedback over a previous draft of this text. No LLM was used or harmed to jot down the English parts of this blog post.
In case you care about verification, simulations and AI in system and infrastructure design, you’ll love working at Bauplan: we’re hiring!