Generating Structured Outputs from LLMs

interface for interacting with LLMs is thru the classic chat UI present in ChatGPT, Gemini, or DeepSeek. The interface is kind of easy, where the user inputs a body of text and the model responds with one other body, which can or may not follow a selected structure. Since humans can understand unstructured natural language, this interface is suitable and quite effective for the target market it was designed for.

Nevertheless, the user base of LLMs is far larger than the 8 billion humans living on Earth. It expands to tens of millions of software programs that may potentially harness the facility of such large generative models. Unlike humans, software programs cannot understand unstructured data, stopping them from exploiting the knowledge generated by these neural networks.

To handle this issue, various techniques have been developed to generate outputs from LLMs following a predefined schema. This text will overview three of the preferred approaches for producing structured outputs from LLMs. It’s written for engineers involved in integrating LLMs into their software applications.

Structured Output Generation

Structured output generation from LLMs involves using these models to supply data that adheres to a predefined schema, slightly than generating unstructured text. The schema might be defined in various formats, with JSON and being probably the most common. For instance, when utilizing JSON format, the schema specifies the expected keys and the information types (comparable to int, string, float, etc.) for every value. The LLM then outputs a JSON object that features only the defined keys and appropriately formatted values.

There are numerous situations where structured output is required from LLMs. Formatting unstructured bodies of text is one large application area of this technology. You need to use a model to extract specific information from large bodies of text and even images (using VLMs). For instance, you should use a general VLM to extract the acquisition date, total price, and store name from receipts.

There are numerous techniques to generate structured outputs from LLMs. This text will discuss three.

Counting on API Providers
Prompting and Reprompting Strategies
Constrained Decoding

Counting on API Providers

Multiple LLM service API providers, including OpenAI and Google’s Gemini, allow users to define a schema for the model’s output. This schema will likely be defined using a Pydantic class and provided to the API endpoint. If you happen to are using LangChain, you may follow this tutorial to integrate structured outputs into your application.

Simplicity is the best aspect of this particular approach. You define the required schema in a way familiar to you, pass it to the API provider, and sit back and calm down because the service provider performs all of the for you.

Using this method, nonetheless, will limit you to using only API providers that provide the described service. This limits the expansion and adaptability of your projects, because it shuts the door to using multiple models, particularly open source ones. If the API providers suddenly resolve to spike the worth of the service, you can be forced either to simply accept the additional costs or look for an additional provider.

Furthermore, it isn’t exactly that the service provider does. The provider follows a certain approach to generate the structured output for you. Knowledge of the underlying technology will facilitate the app development and speed up the debugging process and error understanding. For the mentioned reasons, grasping the underlying science might be well worth the effort.

Prompting and Reprompting-Based Techniques

If you have got chatted with an LLM before, then this method might be in your mind. If you happen to desire a model to follow a certain structure, just tell it to achieve this! Within the system prompt, instruct the model to follow a certain structure, provide just a few examples, and ask it not so as to add any additional text or description.

After the model responds to the user request and the system receives the output, you must use a parser to rework the sequence of bytes to an appropriate representation within the system. If parsing succeeds, then congratulate yourself and thank the facility of prompt engineering. If parsing fails, then your system could have to get well from the error.

Prompting is Not Enough

The issue with prompting is unreliability. By itself, prompting just isn’t enough to trust a model to follow a required structure. It’d add extra explanation, disregard certain fields, and use an incorrect data type. Prompting might be and must be coupled with error recovery techniques that handle the case where the model defies the schema, which is detected by parsing failure.

Some people might think that a parser acts like a boolean function. It takes a string as input, checks its adherence to predefined grammar rules, and returns an easy or answer. In point of fact, parsers are more complex than that and supply much richer information than follows’ or structure.

Parsers can detect mistakes and incorrect tokens in input text in accordance with grammar rules (Aho et al. 2007, 192–96). This information provides us with helpful information on the specifics of misalignments within the input string. For instance, the parser is what detects a missing semicolon error while you’re running Java code.

Figure 1 depicts the flow utilized in the prompting-based techniques.

Figure 1: General Flow of Prompting and Reprompting Techniques. Generated using mermaid by the Writer

Prompting Tools

One of the vital popular libraries for prompt based structured output generation from LLMs is instructor. Instructor is a Python library with over 11k stars on GitHub. It supports data definition with Pydantic, integrates with over 15 providers, and provides automatic retries on parsing failure. Along with Python, the package can be avillable in TypeScript, Go, Ruby, and Rust (2).

The fantastic thing about Instructor lies in its simplicity. All you would like is to define a Pydantic class, initialize a client using only its name and API key (if required), and pass your request. The sample code below, from the docs, displays the simplicity of Instructor.

import instructor
from pydantic import BaseModel
from openai import OpenAI


class Person(BaseModel):
    name: str
    age: int
    occupation: str


client = instructor.from_openai(OpenAI())
person = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=Person,
    messages=[
        {
          "role": "user",
          "content": "Extract: John is a 30-year-old software engineer"
        }
    ],
)
print(person)  # Person(name='John', age=30, occupation='software engineer')

The Cost of Reprompting

As convenient because the reprompting technique could be, it comes at a hefty cost. LLM usage cost, either service provider API costs or GPU usage, scales linearly with the variety of input tokens and the variety of generated tokens.

As mentioned earlier prompting based techniques might require reprompting. The reprompt could have roughly the identical cost as the unique one. Hence, the fee scales linearly with the variety of reprompts.

If you happen to’re going to make use of this method, you have got to maintain the fee problem in mind. Nobody desires to be surprised by a big bill from an API provider. One idea to assist cut surprising costs is to place emergency brakes into the system by applying a hard-coded limit on the variety of allowed reprompts. This may enable you to put an upper limit on the prices of a single prompt and reprompt cycle.

Constrained Decoding

Unlike the prompting, constrained decoding doesn’t need retries to generate a legitimate, structure-following output. Constrained decoding utilizes computational linguistics techniques and knowledge of the token generation process in LLMs to generate outputs which can be guaranteed to follow the required schema.

How It Works?

LLMs are autoregressive models. They generate one token at a time and the generated tokens are used as inputs to the identical model.

The last layer of an LLM is essentially a logistic regression model that calculates for every token within the model’s vocabulary the probability of it following the input sequence. The model calculates the logits value for every token, then using the softmax function, these value are scaled and transformed to probability values.

Constrained decoding produces structured outputs by limiting the available tokens at each generation step. The tokens are picked in order that the ultimate output obeys the required structure. To determine how the set of possible next tokens might be determined, we want to go to .

Regular expressions, , are used to define specific patterns of text. They’re used to envision if a sequence of text matches an expected structure or schema. So mainly, is a language that might be used to define expected structures from LLMs. Due to its popularity, there’s a big selection of tools and libraries that transforms other forms of knowledge structure definition like Pydantic classes and JSON to . Due to its flexibility and the wide availability of conversion tools, we will transform our goal now and give attention to using LLMs to generate outputs following a pattern.

Deterministic Finite Automata (DFA)

Certainly one of the ways a pattern might be compiled and tested against a body of text is by transforming the pattern right into a deterministic finite automata (DFA). A DFA is solely a state machine that’s used to envision if a string follows a certain structure or pattern.

A DFA consists of 5 components.

A set of tokens (called the alphabet of the DFA)
A set of states
A set of transitions. Each transition connects two states (possibly connecting a state with itself) and is annotated with a token from the alphabet
A start state (marked with an input arrow)
A number of final states (marked as double circles)

A string is a sequence of tokens. To check a string against the pattern defined by a DFA, you start at first state and loop over the string’s tokens, taking the transition corresponding to the token at each move. If at any point you have got a token for which no corresponding transition exists from the present state, parsing fails and the string defies the schema. If parsing ends at one among the ultimate states, then the string matches the pattern; otherwise it also fails.

Figure 2: Example for a DFA with the alphabet `{a, b}`, states `{q0, q1, q2}`, and a single final state, `q2`. Generated using Grpahviz by the Writer.

For instance, the string abab matches the pattern in Figure 2 because starting at q0 and following the transitions marked with a, b, a, and b on this order will land us at q2, which is a final state.

Alternatively, the string abba doesn’t match the pattern because its path ends at q0 which isn’t a final state.

A fantastic thing about is that it will possibly be compiled right into a DFA; in spite of everything, they are only two alternative ways to specify patterns. Discussion of such a metamorphosis is out of scope for this text. The interested reader can check Aho et al. (2007, 152–66) for a discussion of two techniques to perform the transformation.

DFA for Valid Next Tokens Set

Figure 3: Example for a DFA generated from the `a(b|c)*d`. Generated using Grpahviz by the Writer.

Let’s recap what we now have reached up to now. We wanted a way to determine the set of valid next tokens to follow a certain schema. We defined the schema using and transformed it right into a DFA. Now we’re going to point out that a DFA informs us of the set of possible tokens at any point during parsing, fitting our requirements and wishes.

After constructing the DFA, we will easily determine in the set of valid next tokens while standing at any state. It’s the set of tokens annotating any transition exiting from the present state.

Consider the DFA in Figure 3, for instance. The next table shows the set of valid next tokens for every state.

State	Valid Next Tokens
`q0`	{`a`}
`q1`	{`b`, `c`, `d`}
`q2`	{}

Applying the DFA to LLMs

Getting back to our structured output from LLMs problem, we will transform our schema to a then to a DFA. The alphabet of this DFA can be set to the LLM’s vocabulary (the set of all tokens the model can generate). While the model generates tokens, we are going to move through the DFA, starting at first state. At each step, we are going to have the option to find out the set of valid next tokens.

The trick now happens on the softmax scaling stage. By zeroing out the logits of all tokens that usually are not within the valid tokens set, we are going to calculate probabilities just for valid tokens, forcing the model to generate a sequence of tokens that follows the schema. That way, we will generate structured outputs with zero additional costs!

Constrained Decoding Tools

One of the vital popular Python libraries for constrained decoding is Outlines (Willard and Louf 2023). It is vitally easy to make use of and integrates with many LLM providers like OpenAI, Anthropic, Ollama, and vLLM.

You possibly can define the schema using a Pydantic class, for which the library handles the transformation, or directly using a pattern.

from pydantic import BaseModel
from typing import Literal
import outlines
import openai

class Customer(BaseModel):
    name: str
    urgency: Literal["high", "medium", "low"]
    issue: str

client = openai.OpenAI()
model = outlines.from_openai(client, "gpt-4o")

customer = model(
    "Alice needs help with login issues ASAP",
    Customer
)
# ✓ At all times returns valid Customer object
# ✓ No parsing, no errors, no retries

The code snippet above from the docs displays the simplicity of using Outlines. For more information on the library, you may check the docs and the dottxt blogs.

Conclusion

Structured output generation from LLMs is a strong tool that expands the possible use cases of LLMs beyond the straightforward human chat. This text discussed three approaches: counting on API providers, prompting and reprompting strategies, and constrained decoding. For many scenarios, constrained decoding is the favoured method due to its flexibility and low price. Furthermore, the existence of popular libraries like Outlines simplifies the introduction of constrained decoding to software projects.

If you wish to learn more about constrained decoding, then I might highly recommend this course from deeplearning.ai and dottxt, the creators of Outlines library. Using videos and code examples, this course will enable you to get hands-on experience getting structured outputs from LLMs using the techniques discussed on this post.

References

[1] Aho, Alfred V., Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman, (2007), Pearson/Addison Wesley

[2] Willard, Brandon T., and Rémi Louf, Efficient Guided Generation for Large Language Models (2023), https://arxiv.org/abs/2307.09702.

Generating Structured Outputs from LLMs

Structured Output Generation

Counting on API Providers