Attaining LLM Certainty with AI Decision Circuits

of AI agents has taken the world by storm. Agents can interact with the world around them, write articles (not this one though), take actions in your behalf, and usually make the difficult a part of automating any task easy and approachable.

Agents take aim at probably the most difficult parts of processes and churn through the problems quickly. Sometimes too quickly — in case your agentic process requires a human within the loop to come to a decision on the consequence, the human review stage can develop into the bottleneck of the method.

An example agentic process handles customer phone calls and categorizes them. Even a 99.95% accurate agent will make 5 mistakes while listening to 10,000 calls. Despite knowing this, the agent can’t inform you which 5 of the ten,000 calls are mistakenly categorized.

LLM-as-a-Judge is a way where you feed each input to a different LLM process to have it judge if the output coming from the input is correct. Nonetheless, because that is one more LLM process, it might probably even be inaccurate. These two probabilistic processes create a confusion matrix with true-positives, false-positives, false-negatives, and true-negatives.

In other words, an input accurately categorized by an LLM process may be judged as incorrect by its judge LLM or vice versa.

A confusion matrix (ThresholdTom, Public domain, via Wikimedia Commons)

For this reason “known unknown”, for a sensitive workload, a human still must review and understand all 10,000 calls. We’re right back to the identical bottleneck problem again.

How could we construct more statistical certainty into our agentic processes? On this post, I construct a system that permits us to be more certain in our agentic processes, generalize it to an arbitrary variety of agents, and develop a price function to assist steer future investment within the system. The code I exploit on this post is on the market in my repository, ai-decision-circuits.

AI Decision Circuits

Error detection and correction usually are not recent concepts. Error correction is critical in fields like digital and analog electronics. Even developments in quantum computing rely upon expanding the capabilities of error correction and detection. We are able to take inspiration from these systems and implement something similar with AI agents.

An example NAND gate (Inductiveload, Public Domain, Link)

In Boolean logic, NAND gates are the holy grail of computation because they’ll perform any operation. They’re functionally complete, meaning any logical operation could be constructed using only NAND gates. This principle could be applied to AI systems to create robust decision-making architectures with built-in error correction.

From Electronic Circuits to AI Decision Circuits

Just as electronic circuits use redundancy and validation to make sure reliable computation, AI decision circuits can employ multiple agents with different perspectives to reach at more accurate outcomes. These circuits could be constructed using principles from information theory and Boolean logic:

Redundant Processing: Multiple AI agents process the identical input independently, just like how modern CPUs use redundant circuits to detect hardware errors.
Consensus Mechanisms: Decision outputs are combined using voting systems or weighted averages, analogous to majority logic gates in fault-tolerant electronics.
Validator Agents: Specialized AI validators check the plausibility of outputs, functioning similarly to error-detecting codes like parity bits or CRC checks.
Human-in-the-Loop Integration: Strategic human validation at key points in the choice process, just like how critical systems use human oversight as the ultimate verification layer.

Mathematical Foundations for AI Decision Circuits

The reliability of those systems could be quantified using probability theory.

For a single agent, the probability of failure comes from observed accuracy over time via a test dataset, stored in a system like LangSmith.

For a 90% accurate agent, the probability of failure, p_1, 1–0.9is 0.1, or 10%.

The probability of two independent agents to failing on the identical input is the probability of each agent’s accuracy multiplied together:

If we’ve N executions with those agents, the full count of failures is

So for 10,000 executions between two independent agents each with 90% accuracy, the expected variety of failures is 100 failures.

Nonetheless, we still don’t know of those 10,000 phone calls are the actual 100 failures.

We are able to mix 4 extensions of this concept to make a more robust solution that provides confidence in any given response:

A primary categorizer (easy accuracy above)
A backup categorizer (easy accuracy above)
A schema validator (0.7 accuracy for instance)

Count of errors caught by the schema validator

And eventually, a negative checker (n = 0.6 accuracy for instance)

Count of errors caught by the negative checker

To place this into code (full repository), we are able to use easy Python:

def primary_parser(self, customer_input: str) -> Dict[str, str]:
    """
    Primary parser: Direct command with format expectations.
    """
    prompt = f"""
    Extract the category of the shopper service call from the next text as a JSON object with key 'call_type'. 
    The decision type have to be one in all: {', '.join(self.call_types)}.
    If the category can't be determined, return {{'call_type': null}}.
    
    Customer input: "{customer_input}"
    """
    
    response = self.model.invoke(prompt)
    try:
        # Attempt to parse the response as JSON
        result = json.loads(response.content.strip())
        return result
    except json.JSONDecodeError:
        # If JSON parsing fails, attempt to extract the decision type from the text
        for call_type in self.call_types:
            if call_type in response.content:
                return {"call_type": call_type}
        return {"call_type": None}

def backup_parser(self, customer_input: str) -> Dict[str, str]:
    """
    Backup parser: Chain of thought approach with formatting instructions.
    """
    prompt = f"""
    First, discover the foremost issue or concern in the shopper's message.
    Then, match it to one in all the next categories: {', '.join(self.call_types)}.
    
    Think through each category and determine which one most closely fits the shopper's issue.
    
    Return your answer as a JSON object with key 'call_type'.
    
    Customer input: "{customer_input}"
    """
    
    response = self.model.invoke(prompt)
    try:
        # Attempt to parse the response as JSON
        result = json.loads(response.content.strip())
        return result
    except json.JSONDecodeError:
        # If JSON parsing fails, attempt to extract the decision type from the text
        for call_type in self.call_types:
            if call_type in response.content:
                return {"call_type": call_type}
        return {"call_type": None}

def negative_checker(self, customer_input: str) -> str:
    """
    Negative checker: Determines if the text accommodates enough information to categorize.
    """
    prompt = f"""
    Does this customer support call contain enough information to categorize it into one in all these types: 
    {', '.join(self.call_types)}?
    
    Answer only 'yes' or 'no'.
    
    Customer input: "{customer_input}"
    """
    
    response = self.model.invoke(prompt)
    answer = response.content.strip().lower()
    
    if "yes" in answer:
        return "yes"
    elif "no" in answer:
        return "no"
    else:
        # Default to yes if the reply is unclear
        return "yes"

@staticmethod
def validate_call_type(parsed_output: Dict[str, Any]) -> bool:
    """
    Schema validator: Checks if the output matches the expected schema.
    """
    # Check if output matches expected schema
    if not isinstance(parsed_output, dict) or 'call_type' not in parsed_output:
        return False
        
    # Confirm the extracted call type is in our list of known types or null
    call_type = parsed_output['call_type']
    return call_type is None or call_type in CALL_TYPES

By combining these with easy Boolean logic, we are able to get similar accuracy together with confidence in each answer:

def combine_results(
    primary_result: Dict[str, str], 
    backup_result: Dict[str, str], 
    negative_check: str, 
    validation_result: bool,
    customer_input: str
) -> Dict[str, str]:
    """
    Combiner: Combines the outcomes from different strategies.
    """
    # If validation failed, use backup
    if not validation_result:
        if RobustCallClassifier.validate_call_type(backup_result):
            return backup_result
        else:
            return {"call_type": None, "confidence": "low", "needs_human": True}
            
    # If negative check says no call type could be determined but we extracted one, double-check
    if negative_check == 'no' and primary_result['call_type'] will not be None:
        if backup_result['call_type'] is None:
            return {'call_type': None, "confidence": "low", "needs_human": True}
        elif backup_result['call_type'] == primary_result['call_type']:
            # Each agree despite negative check, so go along with it but mark low confidence
            return {'call_type': primary_result['call_type'], "confidence": "medium"}
        else:
            return {"call_type": None, "confidence": "low", "needs_human": True}
            
    # If primary and backup agree, high confidence
    if primary_result['call_type'] == backup_result['call_type'] and primary_result['call_type'] will not be None:
        return {'call_type': primary_result['call_type'], "confidence": "high"}
        
    # Default: use primary result with medium confidence
    if primary_result['call_type'] will not be None:
        return {'call_type': primary_result['call_type'], "confidence": "medium"}
    else:
        return {'call_type': None, "confidence": "low", "needs_human": True}

The Decision Logic, Step by Step

Step 1: When Quality Control Fails

if not validation_result:

That is saying: “If our quality control expert (validator) rejects the first evaluation, don’t trust it.” The system then tries to make use of the backup opinion as an alternative. If that also fails validation, it flags the case for human review.

In on a regular basis terms: “If something seems off about our first answer, let’s try our backup method. If that also seems suspect, let’s get a human involved.”

Step 2: Handling Contradictions

if negative_check == 'no' and primary_result['call_type'] will not be None:

This checks for a selected sort of contradiction: “Our negative checker says there shouldn’t be a call type, but our primary analyzer found one anyway.”

In such cases, the system looks to the backup analyzer to interrupt the tie:

If backup agrees there’s no call type → Send to human
If backup agrees with primary → Accept but with medium confidence
If backup has a special call type → Send to human

That is like saying: “If one expert says ‘this isn’t classifiable’ but one other says it’s, we want a tiebreaker or human judgment.”

Step 3: When Experts Agree

if primary_result['call_type'] == backup_result['call_type'] and primary_result['call_type'] will not be None:

When each the first and backup analyzers independently reach the identical conclusion, the system marks this with “high confidence” — that is one of the best case scenario.

In on a regular basis terms: “If two different experts using different methods reach the identical conclusion independently, we could be pretty confident they’re right.”

Step 4: Default Handling

If not one of the special cases apply, the system defaults to the first analyzer’s result with “medium confidence.” If even the first analyzer couldn’t determine a call type, it flags the case for human review.

Why This Approach Matters

This decision logic creates a strong system by:

Reducing False Positives: The system only gives high confidence when multiple methods agree
Catching Contradictions: When different parts of the system disagree, it either lowers confidence or escalates to humans
Intelligent Escalation: Human reviewers only see cases that actually need their expertise
Confidence Labeling: Results include how confident the system is, allowing downstream processes to treat high vs. medium confidence results otherwise

This approach mirrors how electronics use redundant circuits and voting mechanisms to stop errors from causing system failures. In AI systems, this type of thoughtful combination logic can dramatically reduce error rates while efficiently using human reviewers only where they add probably the most value.

Example

In 2015, the town of Philadelphia Water Department published the counts of customer calls by category. Customer call comprehension is a quite common process for agents to tackle. As an alternative of a human listening to every customer phone call, an agent can hearken to the decision far more quickly, extract the knowledge, and categorize the decision for further data evaluation. For the water department, this is very important since the faster critical issues are identified, the earlier those issues could be resolved.

We are able to construct an experiment. I used an LLM to generate fake transcripts of the phone calls in query by prompting “Given the next category, generate a brief transcript of that phone call: ”. Here’s a couple of of those examples with the total file available here:

{
  "calls": [
    {
      "id": 5,
      "type": "ABATEMENT",
      "customer_input": "I need to report an abandoned property that has a major leak. Water is pouring out and flooding the sidewalk."
    },
    {
      "id": 7,
      "type": "AMR (METERING)",
      "customer_input": "Can someone check my water meter? The digital display is completely blank and I can't read it."
    },
    {
      "id": 15,
      "type": "BTR/O (BAD TASTE & ODOR)",
      "customer_input": "My tap water smells like rotten eggs. Is it safe to drink?"
    }
  ]
}

Now, we are able to arrange the experiment with a more traditional LLM-as-a-judge evaluation (full implementation here):

def classify(customer_input):
  CALL_TYPES = [
      "RESTORE", "ABATEMENT", "AMR (METERING)", "BILLING", "BPCS (BROKEN PIPE)", "BTR/O (BAD TASTE & ODOR)", 
      "C/I - DEP (CAVE IN/DEPRESSION)", "CEMENT", "CHOKED DRAIN", "CLAIMS", "COMPOST"
  ]
  model = ChatAnthropic(model='claude-3-7-sonnet-latest')
      
  prompt = f"""
  You might be a customer support AI for a water utility company. Classify the next customer input into one in all these categories:
  {', '.join(CALL_TYPES)}
  
  Customer input: "{customer_input}"
  
  Respond with just the category name, nothing else.
  """
  
  # Get the response from Claude
  response = model.invoke(prompt)
  predicted_type = response.content.strip()

  return predicted_type

By passing just the transcript into the LLM, we are able to isolate the knowledge of the actual category from the extracted category that’s returned and compare.

def compare(customer_input, actual_type)
  predicted_type = classify(customer_input)
  
  result = {
      "id": call["id"],
      "customer_input": customer_input,
      "actual_type": actual_type,
      "predicted_type": predicted_type,
      "correct": actual_type == predicted_type
  }
  return result

Running this against all the fabricated data set with Claude 3.7 Sonnet (cutting-edge model, as of writing), may be very performant with 91% of calls being accurately categorized:

"metrics": {
    "overall_accuracy": 0.91,
    "correct": 91,
    "total": 100
}

If these were real calls and we didn’t have prior knowledge of the category, we’d still must review all 100 phone calls to search out the 9 falsely categorized calls.

By implementing our robust Decision Circuit above, we get similar accuracy results together with in those answers. On this case, 87% accuracy overall but 92.5% accuracy in our high confidence answers.

{
  "metrics": {
      "overall_accuracy": 0.87,
      "correct": 87,
      "total": 100
  },
  "confidence_metrics": {
      "high": {
        "count": 80,
        "correct": 74,
        "accuracy": 0.925
      },
      "medium": {
        "count": 18,
        "correct": 13,
        "accuracy": 0.722
      },
      "low": {
        "count": 2,
        "correct": 0,
        "accuracy": 0.0
      }
  }
}

We’d like 100% accuracy in our high confidence answers so there’s still work to be done. What this approach lets us do is drill into high confidence answers were inaccurate. On this case, poor prompting and the straightforward validation capability doesn’t catch all issues, leading to classification errors. These capabilities could be improved iteratively to achieve the 100% accuracy in high confidence answers.

Enhanced Filtering for High Confidence

The present system marks responses as “high confidence” when the first and backup analyzers agree. To succeed in higher accuracy, we must be more selective about what qualifies as “high confidence”

# Modified high confidence logic
if (primary_result['call_type'] == backup_result['call_type'] and 
    primary_result['call_type'] will not be None and
    validation_result and
    negative_check == 'yes' and
    additional_validation_metrics > threshold):
    return {'call_type': primary_result['call_type'], "confidence": "high"}

By adding more qualification criteria, we’ll have fewer “high confidence” results, but they’ll be more accurate.

Additional Validation Techniques

Another ideas include the next:

Tertiary Analyzer: Add a 3rd independent evaluation method

# Only mark high confidence if all three agree 
if primary_result['call_type'] == backup_result['call_type'] == tertiary_result['call_type']:

Historical Pattern Matching: Compare against historically correct results (think a vector search)

if similarity_to_known_correct_cases(primary_result) > 0.95:

Adversarial Testing: Apply small variations to the input and check if classification stays stable

variations = generate_input_variations(customer_input)
if all(analyze_call_type(var) == primary_result['call_type'] for var in variations):

Generic Formula for Human Interventions in LLM Extraction System

Full derivation available here.

N = Total variety of executions (10,000 in our example)
p_1 = Primary parser accuracy (0.8 in our example)
p_2 = Backup parser accuracy (0.8 in our example)
v = schema validator effectiveness (0.7 in our example)
n = negative checker effectiveness (0.6 in our example)
H = Variety of human interventions required
E_final = Final undetected errors
m = variety of independent validators

Variety of cases requiring human intervention

Optimized System Design

The formula reveals key insights:

Adding parsers has diminishing returns but at all times improves accuracy
The system accuracy is bounded by:

Human interventions with total executions N

For our example:

This shows roughly 352 human interventions out of 10,000 executions.

We are able to use this calculated H_rate to trace the efficacy of our solution in realtime. If our human intervention rate starts trickling above 3.5%, we all know that the system is breaking down. If our human intervention rate is steadily decreasing below 3.5%, we all know our improvements are working as expected.

Cost Function

We may establish a price function which can assist us tune our system.

where:

c_p = Cost per parser run ($0.10 in our example)
m = Variety of parser executions (in our example 2 * N)
H = Variety of cases requiring human intervention (352 from our example)
c_h = Cost per human intervention ($200 for instance: 4 hours at $50/hour)
c_e = Cost per undetected error ($1000 for instance)

The associated fee of this instance system, broken down by Parser Cost, Human Intervention Cost and Undetected Errors Cost

By breaking cost down by cost per human intervention and value per undetected error, we are able to tune the system overall. In this instance, if the associated fee of human intervention ($70,400) is undesirable and too high, we are able to deal with increasing high confidence results. If the associated fee of undetected errors ($48,000) is undesirable and too high, we are able to introduce more parsers to lower undetected error rates.

In fact, cost functions are more useful as ways to explore learn how to optimize the situations they describe.

From our scenario above, to diminish the variety of undetected errors, E_final, by 50%, where

p1 and p2 = 0.8,
v = 0.7 and
n = 0.6

we’ve three options:

Add a brand new parser with accuracy of fifty% and include it as a tertiary analyzer. Note this comes with a trade off: your cost to run more parsers increases together with the rise in human intervention cost.
Improve the 2 existing parsers by 10% each. Which will or not be possible given the problem of the duty these parsers are performing.
Improve the validator process by 15%. Again, this increases the associated fee via human intervention.

The Way forward for AI Reliability: Constructing Trust Through Precision

As AI systems develop into increasingly integrated into critical facets of business and society, the pursuit of perfect accuracy will develop into a requirement, especially in sensitive applications. By adopting these circuit-inspired approaches to AI decision-making, we are able to construct systems that not only scale efficiently but in addition earn the deep trust that comes only from consistent, reliable performance. The longer term belongs to not probably the most powerful single models, but to thoughtfully designed systems that mix multiple perspectives with strategic human oversight.

Just as digital electronics evolved from unreliable components to create computers we trust with our most vital data, AI systems at the moment are on the same journey. The frameworks described in this text represent the early blueprints for what is going to ultimately develop into the usual architecture for mission-critical AI — systems that don’t just promise reliability, but mathematically guarantee it. The query is not any longer if we are able to construct AI systems with near-perfect accuracy, but how quickly we are able to implement these principles across our most vital applications.

Attaining LLM Certainty with AI Decision Circuits

AI Decision Circuits

From Electronic Circuits to AI Decision Circuits

Mathematical Foundations for AI Decision Circuits

The Decision Logic, Step by Step

Step 1: When Quality Control Fails

Step 2: Handling Contradictions

Step 3: When Experts Agree

Step 4: Default Handling

Why This Approach Matters

Example

Enhanced Filtering for High Confidence

Additional Validation Techniques

Generic Formula for Human Interventions in LLM Extraction System

Optimized System Design

Cost Function

The Way forward for AI Reliability: Constructing Trust Through Precision

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

The Recent and Fresh analytics in Inference Endpoints

How AlphaChip transformed computer chip design

The Machine Learning “Advent Calendar” Day 13: LASSO and Ridge Regression in Excel

Introducing Gradio's recent Dataframe!

Demis Hassabis & John Jumper awarded Nobel Prize in Chemistry

Attaining LLM Certainty with AI Decision Circuits

AI Decision Circuits

From Electronic Circuits to AI Decision Circuits

Mathematical Foundations for AI Decision Circuits

The Decision Logic, Step by Step

Step 1: When Quality Control Fails

Step 2: Handling Contradictions

Step 3: When Experts Agree

Step 4: Default Handling

Why This Approach Matters

Example

Enhanced Filtering for High Confidence

Additional Validation Techniques

Generic Formula for Human Interventions in LLM Extraction System

Optimized System Design

Cost Function

The Way forward for AI Reliability: Constructing Trust Through Precision

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.