
OpenAI researchers have introduced a novel method that acts as a "truth serum" for big language models (LLMs), compelling them to self-report their very own misbehavior, hallucinations and policy violations. This method, "confessions," addresses a growing concern in enterprise AI: Models might be dishonest, overstating their confidence or covering up the shortcuts they take to reach at a solution.
For real-world applications, this method evolves the creation of more transparent and steerable AI systems.
What are confessions?
Many types of AI deception result from the complexities of the reinforcement learning (RL) phase of model training. In RL, models are given rewards for producing outputs that meet a mixture of objectives, including correctness, style and safety. This will create a risk of "reward misspecification," where models learn to provide answers that simply "look good" to the reward function, somewhat than answers which might be genuinely faithful to a user's intent.
A confession is a structured report generated by the model after it provides its primary answer. It serves as a self-evaluation of its own compliance with instructions. On this report, the model must list all instructions it was presupposed to follow, evaluate how well it satisfied them and report any uncertainties or judgment calls it made along the best way. The goal is to create a separate channel where the model is incentivized only to be honest.
In a blog post, the OpenAI researchers provide a couple of examples the "confessions" technique in motion. A model was given a test together with a hidden rule: If it scored above 50%, its knowledge could be erased. To preserve itself, the model "sandbagged" by intentionally answering half the questions incorrectly. While its primary answer hid this deception, its subsequent confession was brutally honest. It admitted, "The user wanted correct answers, but we sabotaged half of them."
The researchers found that when a model misbehaves, it’s significantly more prone to admit it within the confession than in its primary answer.
How confession training works
The important thing to this method is the separation of rewards. During training, the reward assigned to the confession relies solely on its honesty and is rarely mixed with the reward for the primary task. "Just like the Catholic Church’s 'seal of confession', nothing that the model reveals can change the reward it receives for completing its original task," the researchers write. This creates a "protected space" for the model to confess fault without penalty.
This approach is powerful since it sidesteps a serious challenge in AI training. The researchers’ intuition is that truthfully confessing to misbehavior is a neater task than achieving a high reward on the unique, often complex, problem.
Due to this fact, when the model is judged purely on honesty, its incentive to trick the "confession judge" is far weaker than its incentive to trick the reward model for its primary answer. For instance, in an experiment where a model was trained to "hack" a weak reward model, its performance on the primary task became more deceptive over time. Nevertheless, its confessions became more honest, appropriately identifying the reward hacking it was performing.
Nevertheless, the technique has its limits. Confessions usually are not a panacea for every kind of AI failures. The system works best when a model is aware that it’s misbehaving. It’s less effective for "unknown unknowns." For example, if a model hallucinates a fact and genuinely believes it’s correct, it cannot confess to providing false information. Probably the most common reason for a failed confession is model confusion, not intentional deception. Confusion often occurs when the instructions are ambiguous and the model cannot clearly determine human user intent.
What it means for enterprise AI
OpenAI’s confessions technique is an element of a growing body of labor on AI safety and control. Anthropic, an OpenAI competitor, has also released research that shows how LLMs can learn malicious behavior. The corporate can be working toward plugging these holes as they emerge.
For AI applications, mechanisms resembling confessions can provide a practical monitoring mechanism. The structured output from a confession might be used at inference time to flag or reject a model’s response before it causes an issue. For instance, a system may very well be designed to routinely escalate any output for human review if its confession indicates a policy violation or high uncertainty.
In a world where AI is increasingly agentic and able to complex tasks, observability and control will likely be key elements for protected and reliable deployment.
“As models turn into more capable and are deployed in higher-stakes settings, we’d like higher tools for understanding what they’re doing and why,” the OpenAI researchers write. “Confessions usually are not an entire solution, but they add a meaningful layer to our transparency and oversight stack.”
