of AI agents has taken the world by storm. Agents can interact with the world around them, write articles (not this one though), take actions in your behalf, and usually make the difficult a part of automating any task easy and approachable.Ā
Agents take aim at probably the most difficult parts of processes and churn through the problems quickly. Sometimes too quicklyāāāin case your agentic process requires a human within the loop to come to a decision on the consequence, the human review stage can develop into the bottleneck of the method.Ā
An example agentic process handles customer phone calls and categorizes them. Even a 99.95% accurate agent will make 5 mistakes while listening to 10,000 calls. Despite knowing this, the agent canāt inform you which 5 of the ten,000 calls are mistakenly categorized.
LLM-as-a-Judge is a way where you feed each input to a different LLM process to have it judge if the output coming from the input is correct. Nonetheless, because that is one more LLM process, it might probably even be inaccurate. These two probabilistic processes create a confusion matrix with true-positives, false-positives, false-negatives, and true-negatives.Ā
In other words, an input accurately categorized by an LLM process may be judged as incorrect by its judge LLM or vice versa.
For this reason āknown unknownā, for a sensitive workload, a human still must review and understand all 10,000 calls. Weāre right back to the identical bottleneck problem again.Ā
How could we construct more statistical certainty into our agentic processes? On this post, I construct a system that permits us to be more certain in our agentic processes, generalize it to an arbitrary variety of agents, and develop a price function to assist steer future investment within the system. The code I exploit on this post is on the market in my repository, ai-decision-circuits.
AI DecisionĀ Circuits
Error detection and correction usually are not recent concepts. Error correction is critical in fields like digital and analog electronics. Even developments in quantum computing rely upon expanding the capabilities of error correction and detection. We are able to take inspiration from these systems and implement something similar with AI agents.Ā

In Boolean logic, NAND gates are the holy grail of computation because they’ll perform any operation. They’re functionally complete, meaning any logical operation could be constructed using only NAND gates. This principle could be applied to AI systems to create robust decision-making architectures with built-in error correction.
From Electronic Circuits to AI DecisionĀ Circuits
Just as electronic circuits use redundancy and validation to make sure reliable computation, AI decision circuits can employ multiple agents with different perspectives to reach at more accurate outcomes. These circuits could be constructed using principles from information theory and Boolean logic:
- Redundant Processing: Multiple AI agents process the identical input independently, just like how modern CPUs use redundant circuits to detect hardware errors.
- Consensus Mechanisms: Decision outputs are combined using voting systems or weighted averages, analogous to majority logic gates in fault-tolerant electronics.
- Validator Agents: Specialized AI validators check the plausibility of outputs, functioning similarly to error-detecting codes like parity bits or CRC checks.
- Human-in-the-Loop Integration: Strategic human validation at key points in the choice process, just like how critical systems use human oversight as the ultimate verification layer.
Mathematical Foundations for AI DecisionĀ Circuits
The reliability of those systems could be quantified using probability theory.
For a single agent, the probability of failure comes from observed accuracy over time via a test dataset, stored in a system like LangSmith.Ā

For a 90% accurate agent, the probability of failure, p_1, 1ā0.9is 0.1, or 10%.

The probability of two independent agents to failing on the identical input is the probability of each agentās accuracy multiplied together:Ā

If we’ve N executions with those agents, the full count of failures is

So for 10,000 executions between two independent agents each with 90% accuracy, the expected variety of failures is 100 failures.

Nonetheless, we still donāt know of those 10,000 phone calls are the actual 100 failures.
We are able to mix 4 extensions of this concept to make a more robust solution that provides confidence in any given response:Ā
- A primary categorizer (easy accuracy above)
- A backup categorizer (easy accuracy above)
- A schema validator (0.7 accuracy for instance)


- And eventually, a negative checker (n = 0.6 accuracy for instance)


To place this into code (full repository), we are able to use easy Python:
def primary_parser(self, customer_input: str) -> Dict[str, str]:
"""
Primary parser: Direct command with format expectations.
"""
prompt = f"""
Extract the category of the shopper service call from the next text as a JSON object with key 'call_type'.
The decision type have to be one in all: {', '.join(self.call_types)}.
If the category can't be determined, return {{'call_type': null}}.
Customer input: "{customer_input}"
"""
response = self.model.invoke(prompt)
try:
# Attempt to parse the response as JSON
result = json.loads(response.content.strip())
return result
except json.JSONDecodeError:
# If JSON parsing fails, attempt to extract the decision type from the text
for call_type in self.call_types:
if call_type in response.content:
return {"call_type": call_type}
return {"call_type": None}
def backup_parser(self, customer_input: str) -> Dict[str, str]:
"""
Backup parser: Chain of thought approach with formatting instructions.
"""
prompt = f"""
First, discover the foremost issue or concern in the shopper's message.
Then, match it to one in all the next categories: {', '.join(self.call_types)}.
Think through each category and determine which one most closely fits the shopper's issue.
Return your answer as a JSON object with key 'call_type'.
Customer input: "{customer_input}"
"""
response = self.model.invoke(prompt)
try:
# Attempt to parse the response as JSON
result = json.loads(response.content.strip())
return result
except json.JSONDecodeError:
# If JSON parsing fails, attempt to extract the decision type from the text
for call_type in self.call_types:
if call_type in response.content:
return {"call_type": call_type}
return {"call_type": None}
def negative_checker(self, customer_input: str) -> str:
"""
Negative checker: Determines if the text accommodates enough information to categorize.
"""
prompt = f"""
Does this customer support call contain enough information to categorize it into one in all these types:
{', '.join(self.call_types)}?
Answer only 'yes' or 'no'.
Customer input: "{customer_input}"
"""
response = self.model.invoke(prompt)
answer = response.content.strip().lower()
if "yes" in answer:
return "yes"
elif "no" in answer:
return "no"
else:
# Default to yes if the reply is unclear
return "yes"
@staticmethod
def validate_call_type(parsed_output: Dict[str, Any]) -> bool:
"""
Schema validator: Checks if the output matches the expected schema.
"""
# Check if output matches expected schema
if not isinstance(parsed_output, dict) or 'call_type' not in parsed_output:
return False
# Confirm the extracted call type is in our list of known types or null
call_type = parsed_output['call_type']
return call_type is None or call_type in CALL_TYPES
By combining these with easy Boolean logic, we are able to get similar accuracy together with confidence in each answer:
def combine_results(
primary_result: Dict[str, str],
backup_result: Dict[str, str],
negative_check: str,
validation_result: bool,
customer_input: str
) -> Dict[str, str]:
"""
Combiner: Combines the outcomes from different strategies.
"""
# If validation failed, use backup
if not validation_result:
if RobustCallClassifier.validate_call_type(backup_result):
return backup_result
else:
return {"call_type": None, "confidence": "low", "needs_human": True}
# If negative check says no call type could be determined but we extracted one, double-check
if negative_check == 'no' and primary_result['call_type'] will not be None:
if backup_result['call_type'] is None:
return {'call_type': None, "confidence": "low", "needs_human": True}
elif backup_result['call_type'] == primary_result['call_type']:
# Each agree despite negative check, so go along with it but mark low confidence
return {'call_type': primary_result['call_type'], "confidence": "medium"}
else:
return {"call_type": None, "confidence": "low", "needs_human": True}
# If primary and backup agree, high confidence
if primary_result['call_type'] == backup_result['call_type'] and primary_result['call_type'] will not be None:
return {'call_type': primary_result['call_type'], "confidence": "high"}
# Default: use primary result with medium confidence
if primary_result['call_type'] will not be None:
return {'call_type': primary_result['call_type'], "confidence": "medium"}
else:
return {'call_type': None, "confidence": "low", "needs_human": True}
The Decision Logic, Step byĀ Step
Step 1: When Quality ControlĀ Fails
if not validation_result:
That is saying: āIf our quality control expert (validator) rejects the first evaluation, donāt trust it.ā The system then tries to make use of the backup opinion as an alternative. If that also fails validation, it flags the case for human review.
In on a regular basis terms: āIf something seems off about our first answer, letās try our backup method. If that also seems suspect, letās get a human involved.ā
Step 2: Handling Contradictions
if negative_check == 'no' and primary_result['call_type'] will not be None:
This checks for a selected sort of contradiction: āOur negative checker says there shouldnāt be a call type, but our primary analyzer found one anyway.ā
In such cases, the system looks to the backup analyzer to interrupt the tie:
- If backup agrees thereās no call type ā Send to human
- If backup agrees with primary ā Accept but with medium confidence
- If backup has a special call type ā Send to human
That is like saying: āIf one expert says āthis isnāt classifiableā but one other says it’s, we want a tiebreaker or human judgment.ā
Step 3: When ExpertsĀ Agree
if primary_result['call_type'] == backup_result['call_type'] and primary_result['call_type'] will not be None:
When each the first and backup analyzers independently reach the identical conclusion, the system marks this with āhigh confidenceāāāāthat is one of the best case scenario.
In on a regular basis terms: āIf two different experts using different methods reach the identical conclusion independently, we could be pretty confident theyāre right.ā
Step 4: DefaultĀ Handling
If not one of the special cases apply, the system defaults to the first analyzerās result with āmedium confidence.ā If even the first analyzer couldnāt determine a call type, it flags the case for human review.
Why This ApproachĀ Matters
This decision logic creates a strong system by:
- Reducing False Positives: The system only gives high confidence when multiple methods agree
- Catching Contradictions: When different parts of the system disagree, it either lowers confidence or escalates to humans
- Intelligent Escalation: Human reviewers only see cases that actually need their expertise
- Confidence Labeling: Results include how confident the system is, allowing downstream processes to treat high vs. medium confidence results otherwise
This approach mirrors how electronics use redundant circuits and voting mechanisms to stop errors from causing system failures. In AI systems, this type of thoughtful combination logic can dramatically reduce error rates while efficiently using human reviewers only where they add probably the most value.
Example
In 2015, the town of Philadelphia Water Department published the counts of customer calls by category. Customer call comprehension is a quite common process for agents to tackle. As an alternative of a human listening to every customer phone call, an agent can hearken to the decision far more quickly, extract the knowledge, and categorize the decision for further data evaluation. For the water department, this is very important since the faster critical issues are identified, the earlier those issues could be resolved.
We are able to construct an experiment. I used an LLM to generate fake transcripts of the phone calls in query by prompting āGiven the next category, generate a brief transcript of that phone call:
{
"calls": [
{
"id": 5,
"type": "ABATEMENT",
"customer_input": "I need to report an abandoned property that has a major leak. Water is pouring out and flooding the sidewalk."
},
{
"id": 7,
"type": "AMR (METERING)",
"customer_input": "Can someone check my water meter? The digital display is completely blank and I can't read it."
},
{
"id": 15,
"type": "BTR/O (BAD TASTE & ODOR)",
"customer_input": "My tap water smells like rotten eggs. Is it safe to drink?"
}
]
}
Now, we are able to arrange the experiment with a more traditional LLM-as-a-judge evaluation (full implementation here):
def classify(customer_input):
CALL_TYPES = [
"RESTORE", "ABATEMENT", "AMR (METERING)", "BILLING", "BPCS (BROKEN PIPE)", "BTR/O (BAD TASTE & ODOR)",
"C/I - DEP (CAVE IN/DEPRESSION)", "CEMENT", "CHOKED DRAIN", "CLAIMS", "COMPOST"
]
model = ChatAnthropic(model='claude-3-7-sonnet-latest')
prompt = f"""
You might be a customer support AI for a water utility company. Classify the next customer input into one in all these categories:
{', '.join(CALL_TYPES)}
Customer input: "{customer_input}"
Respond with just the category name, nothing else.
"""
# Get the response from Claude
response = model.invoke(prompt)
predicted_type = response.content.strip()
return predicted_type
By passing just the transcript into the LLM, we are able to isolate the knowledge of the actual category from the extracted category that’s returned and compare.
def compare(customer_input, actual_type)
predicted_type = classify(customer_input)
result = {
"id": call["id"],
"customer_input": customer_input,
"actual_type": actual_type,
"predicted_type": predicted_type,
"correct": actual_type == predicted_type
}
return result
Running this against all the fabricated data set with Claude 3.7 Sonnet (cutting-edge model, as of writing), may be very performant with 91% of calls being accurately categorized:
"metrics": {
"overall_accuracy": 0.91,
"correct": 91,
"total": 100
}
If these were real calls and we didn’t have prior knowledge of the category, weād still must review all 100 phone calls to search out the 9 falsely categorized calls.
By implementing our robust Decision Circuit above, we get similar accuracy results together with in those answers. On this case, 87% accuracy overall but 92.5% accuracy in our high confidence answers.
{
"metrics": {
"overall_accuracy": 0.87,
"correct": 87,
"total": 100
},
"confidence_metrics": {
"high": {
"count": 80,
"correct": 74,
"accuracy": 0.925
},
"medium": {
"count": 18,
"correct": 13,
"accuracy": 0.722
},
"low": {
"count": 2,
"correct": 0,
"accuracy": 0.0
}
}
}
We’d like 100% accuracy in our high confidence answers so thereās still work to be done. What this approach lets us do is drill into high confidence answers were inaccurate. On this case, poor prompting and the straightforward validation capability doesnāt catch all issues, leading to classification errors. These capabilities could be improved iteratively to achieve the 100% accuracy in high confidence answers.
Enhanced Filtering for High Confidence
The present system marks responses as āhigh confidenceā when the first and backup analyzers agree. To succeed in higher accuracy, we must be more selective about what qualifies as āhigh confidenceā
# Modified high confidence logic
if (primary_result['call_type'] == backup_result['call_type'] and
primary_result['call_type'] will not be None and
validation_result and
negative_check == 'yes' and
additional_validation_metrics > threshold):
return {'call_type': primary_result['call_type'], "confidence": "high"}
By adding more qualification criteria, weāll have fewer āhigh confidenceā results, but theyāll be more accurate.
Additional Validation Techniques
Another ideas include the next:
Tertiary Analyzer: Add a 3rd independent evaluation method
# Only mark high confidence if all three agree
if primary_result['call_type'] == backup_result['call_type'] == tertiary_result['call_type']:
Historical Pattern Matching: Compare against historically correct results (think a vector search)
if similarity_to_known_correct_cases(primary_result) > 0.95:
Adversarial Testing: Apply small variations to the input and check if classification stays stable
variations = generate_input_variations(customer_input)
if all(analyze_call_type(var) == primary_result['call_type'] for var in variations):
Generic Formula for Human Interventions in LLM Extraction System
Full derivation available here.
- N = Total variety of executions (10,000 in our example)
- p_1 = Primary parser accuracy (0.8 in our example)
- p_2 = Backup parser accuracy (0.8 in our example)
- v = schema validator effectiveness (0.7 in our example)
- n = negative checker effectiveness (0.6 in our example)
- H = Variety of human interventions required
- E_final = Final undetected errors
- m = variety of independent validators




Optimized SystemĀ Design
The formula reveals key insights:
- Adding parsers has diminishing returns but at all times improves accuracy
- The system accuracy is bounded by:Ā

- Human interventions with total executions N
For our example:

We are able to use this calculated H_rate to trace the efficacy of our solution in realtime. If our human intervention rate starts trickling above 3.5%, we all know that the system is breaking down. If our human intervention rate is steadily decreasing below 3.5%, we all know our improvements are working as expected.
Cost Function
We may establish a price function which can assist us tune our system.

where:Ā
- c_p = Cost per parser run ($0.10 in our example)
- m = Variety of parser executions (in our example 2 * N)
- H = Variety of cases requiring human intervention (352 from our example)
- c_h = Cost per human intervention ($200 for instance: 4 hours at $50/hour)
- c_e = Cost per undetected error ($1000 for instance)

By breaking cost down by cost per human intervention and value per undetected error, we are able to tune the system overall. In this instance, if the associated fee of human intervention ($70,400) is undesirable and too high, we are able to deal with increasing high confidence results. If the associated fee of undetected errors ($48,000) is undesirable and too high, we are able to introduce more parsers to lower undetected error rates.
In fact, cost functions are more useful as ways to explore learn how to optimize the situations they describe.
From our scenario above,Ā to diminish the variety of undetected errors, E_final, by 50%, where
- p1 and p2 = 0.8,
- v = 0.7 andĀ
- n = 0.6
we’ve three options:Ā
- Add a brand new parser with accuracy of fifty% and include it as a tertiary analyzer. Note this comes with a trade off: your cost to run more parsers increases together with the rise in human intervention cost.
- Improve the 2 existing parsers by 10% each. Which will or not be possible given the problem of the duty these parsers are performing.Ā
- Improve the validator process by 15%. Again, this increases the associated fee via human intervention.
The Way forward for AI Reliability: Constructing Trust Through Precision
As AI systems develop into increasingly integrated into critical facets of business and society, the pursuit of perfect accuracy will develop into a requirement, especially in sensitive applications. By adopting these circuit-inspired approaches to AI decision-making, we are able to construct systems that not only scale efficiently but in addition earn the deep trust that comes only from consistent, reliable performance. The longer term belongs to not probably the most powerful single models, but to thoughtfully designed systems that mix multiple perspectives with strategic human oversight.Ā
Just as digital electronics evolved from unreliable components to create computers we trust with our most vital data, AI systems at the moment are on the same journey. The frameworks described in this text represent the early blueprints for what is going to ultimately develop into the usual architecture for mission-critical AIāāāsystems that donāt just promise reliability, but mathematically guarantee it. The query is not any longer if we are able to construct AI systems with near-perfect accuracy, but how quickly we are able to implement these principles across our most vital applications.
