How I Cope with Hallucinations at an AI Startup

-

I work as an AI Engineer in a selected area of interest: document automation and data extraction. In my industry using Large Language Models has presented numerous challenges in terms of hallucinations. Imagine an AI misreading an invoice amount as $100,000 as an alternative of $1,000, resulting in a 100x overpayment. When faced with such risks, stopping hallucinations becomes a critical aspect of constructing robust AI solutions. These are among the key principles I give attention to when designing solutions which may be vulnerable to hallucinations.

There are numerous ways to include human oversight in AI systems. Sometimes, extracted information is at all times presented to a human for review. For example, a parsed resume could be shown to a user before submission to an Applicant Tracking System (ATS). More often, the extracted information is routinely added to a system and only flagged for human review if potential issues arise.

A vital a part of any AI platform is determining when to incorporate human oversight. This often involves several types of validation rules:

1. Easy rules, comparable to ensuring line-item totals match the invoice total.

2. Lookups and integrations, like validating the whole amount against a purchase order order in an accounting system or verifying payment details against a supplier’s previous records.

Validation popup for an invoice. Text “30,000” is highlighted with the following overlaid text: Payment Amount Total | #xpected line item totals to equal document total | Confirm anyway? | Remove?
An example validation error when there must be a human within the loop. Source: Affinda

These processes are a great thing. But we also don’t want an AI that continually triggers safeguards and forces manual human intervention. Hallucinations can defeat the aim of using AI if it’s continually triggering these safeguards.

One solution to stopping hallucinations is to make use of Small Language Models (SLMs) that are “extractive”. Because of this the model labels parts of the document and we collect these labels into structured outputs. I like to recommend attempting to use a SLMs where possible slightly than defaulting to LLMs for each problem. For instance, in resume parsing for job boards, waiting 30+ seconds for an LLM to process a resume is commonly unacceptable. For this use case we’ve found an SLM can provide leads to 2–3 seconds with higher accuracy than larger models like GPT-4o.

An example from our pipeline

In our startup a document could be processed by as much as 7 different models — only 2 of which could be an LLM. That’s because an LLM isn’t at all times the very best tool for the job. Some steps comparable to Retrieval Augmented Generation depend on a small multimodal model to create useful embeddings for retrieval. Step one — detecting whether something is even a document — uses a small and super-fast model that achieves 99.9% accuracy. It’s vital to interrupt an issue down into small chunks after which work out which parts LLMs are best fitted to. This fashion, you reduce the possibilities of hallucinations occurring.

Distinguishing Hallucinations from Mistakes

I make a degree to distinguish between hallucinations (the model inventing information) and mistakes (the model misinterpreting existing information). For example, choosing the unsuitable dollar amount as a receipt total is a mistake, while generating a non-existent amount is a hallucination. Extractive models can only make mistakes, while generative models could make each mistakes and hallucinations.

When using generative models we want a way of eliminating hallucinations.

Grounding refers to any technique which forces a generative AI model to justify its outputs close to some authoritative information. How grounding is managed is a matter of risk tolerance for every project.

For instance — an organization with a general-purpose inbox might look to discover motion items. Often, emails requiring actions are sent on to account managers. A general inbox that’s filled with invoices, spam, and easy replies (“thanks”, “OK”, etc.) has far too many messages for humans to examine. What happens when actions are mistakenly sent to this general inbox? Actions usually get missed. If a model makes mistakes but is usually accurate it’s already doing higher than doing nothing. On this case the tolerance for mistakes/hallucinations could be high.

Other situations might warrant particularly low risk tolerance — think financial documents and “straight-through processing”. That is where extracted information is routinely added to a system without review by a human. For instance, an organization may not allow invoices to be routinely added to an accounting system unless (1) the payment amount exactly matches the quantity in the acquisition order, and (2) the payment method matches the previous payment approach to the supplier.

Even when risks are low, I still err on the side of caution. Each time I’m focused on information extraction I follow an easy rule:

If text is extracted from a document, then it must exactly match text present in the document.

This is hard when the knowledge is structured (e.g. a table) — especially because PDFs don’t carry any information concerning the order of words on a page. For instance, an outline of a line-item might split across multiple lines so the aim is to attract a coherent box across the extracted text whatever the left-to-right order of the words (or right-to-left in some languages).

Forcing the model to point to exact text in a document is “strong grounding”. Strong grounding isn’t limited to information extraction. E.g. customer support chat-bots could be required to cite (verbatim) from standardised responses in an internal knowledge base. This isn’t at all times ideal on condition that standardised responses may not actually give you the option to reply a customer’s query.

One other tricky situation is when information must be inferred from context. For instance, a medical assistant AI might infer the presence of a condition based on its symptoms without the medical condition being expressly stated. Identifying where those symptoms were mentioned could be a type of “weak grounding”. The justification for a response must exist within the context but the precise output can only be synthesised from the supplied information. An additional grounding step might be to force the model to lookup the medical condition and justify that those symptoms are relevant. This may increasingly still need weak grounding because symptoms can often be expressed in some ways.

Using AI to resolve increasingly complex problems could make it difficult to make use of grounding. For instance, how do you ground outputs if a model is required to perform “reasoning” or to infer information from context? Listed below are some considerations for adding grounding to complex problems:

  1. Discover complex decisions which might be broken down right into a algorithm. Relatively than having the model generate a solution to the ultimate decision have it generate the components of that call. Then use rules to display the result. (Caveat — this will sometimes make hallucinations worse. Asking the model multiple questions gives it multiple opportunities to hallucinate. Asking it one query might be higher. But we’ve found current models are generally worse at complex multi-step reasoning.)
  2. If something could be expressed in some ways (e.g. descriptions of symptoms), a primary step might be to get the model to tag text and standardise it (normally known as “coding”). This might open opportunities for stronger grounding.
  3. Arrange “tools” for the model to call which constrain the output to a really specific structure. We don’t need to execute arbitrary code generated by an LLM. We would like to create tools that the model can call and provides restrictions for what’s in those tools.
  4. Wherever possible, include grounding in tool use — e.g. by validating responses against the context before sending them to a downstream system.
  5. Is there a technique to validate the ultimate output? If handcrafted rules are out of the query, could we craft a prompt for verification? (And follow the above rules for the verified model as well).
  • With regards to information extraction, we don’t tolerate outputs not present in the unique context.
  • We follow this up with verification steps that catch mistakes in addition to hallucinations.
  • Anything we do beyond that’s about risk assessment and risk minimisation.
  • Break complex problems down into smaller steps and discover if an LLM is even needed.
  • For complex problems use a scientific approach to discover verifiable task:

— Strong grounding forces LLMs to cite verbatim from trusted sources. It’s at all times preferred to make use of strong grounding.

— Weak grounding forces LLMs to reference trusted sources but allows synthesis and reasoning.

— Where an issue could be broken down into smaller tasks use strong grounding on tasks where possible.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x