Diving Deeper with Structured Outputs

Helping enhance your understanding and optimal usage of structured outputs and LLMs

Figure 1 — steps which might be executed each explicitly, in addition to implicitly, from the user’s perspective, when applying structured outputs; image by the creator

Within the previous article, we were introduced to structured outputs using OpenAI. Because the general availability release in ChatCompletions API (v1.40.0), structured outputs have been applied across dozens of use cases, and spawned quite a few threads on OpenAI forums.

In this text, our aim is to give you an excellent deeper understanding, dispel some misconceptions, and offer you some recommendations on the way to apply them in essentially the most optimal manner possible, across different scenarios.

Structured outputs are a way of enforcing the output of an LLM to follow a pre-defined schema — normally a JSON schema. This works by transforming the schema right into a context free grammar (CFG), which in the course of the token sampling step, is used along with the previously generated tokens, to tell which subsequent tokens are valid. It’s helpful to think about it as making a for token generation.

OpenAI API implementation actually tracks a limited subset of JSON schema features. With more general structured output solutions, resembling Outlines, it is feasible to make use of a somewhat larger subset of the JSON schema, and even define completely custom non-JSON schemas — so long as one has access to an open weight model. For the aim of this text, we’ll assume the OpenAI API implementation.

Based on JSON Schema Core Specification, “JSON Schema asserts what a JSON document must appear to be, ways to extract information from it, and the way to interact with it”. JSON schema defines six primitive types — null, boolean, object, array, number and string. It also defines certain keywords, annotations, and specific behaviours. For instance, we will specify in our schema that we expect an array and add an annotation that minItems shall be 5 .

Pydantic is a Python library that implements the JSON schema specification. We use Pydantic to construct robust and maintainable software in Python. Since Python is a dynamically typed language, data scientists don’t necessarily think by way of variable types — these are sometimes implied of their code. For instance, a fruit can be specified as:

fruit = dict(
name="apple",
color="red",
weight=4.2
)

…while a function declaration that returns “fruit” from some piece of knowledge would often be specified as:

def extract_fruit(s):
...
return fruit

Pydantic however allows us to generate a JSON-schema compliant class, with properly annotated variables and type hints, making our code more readable/maintainable and basically more robust, i.e.

class Fruit(BaseModel):
name: str
color: Literal['red', 'green']
weight: Annotated[float, Gt(0)]def extract_fruit(s: str) -> Fruit:
...
return fruit

OpenAI actually strongly recommends using Pydantic for specifying schemas, versus specifying the “raw” JSON schema directly. There are several reasons for this. Firstly, Pydantic is guaranteed to stick to the JSON schema specification, so it saves you additional pre-validation steps. Secondly, for larger schemas, it’s less verbose, allowing you to put in writing cleaner code, faster. Finally, the openai Python package actually does some housekeeping, like setting additionalProperties to False for you, whereas when defining your schema “by-hand” using JSON, you would want to set these manually, for each object in your schema (failing to achieve this leads to a moderately annoying API error).

As we alluded to previously, the ChatCompletions API provides a limited subset of the total JSON schema specification. There are many keywords that are usually not yet supported, resembling minimum and maximum for numbers, and minItems and maxItems for arrays — annotations that might be otherwise very useful in reducing hallucinations, or constraining the output size.

Certain formatting features are also unavailable. For instance, the next Pydantic schema would lead to API error when passed to response_format in ChatCompletions:

class NewsArticle(BaseModel):
headline: str
subheading: str
authors: List[str]
date_published: datetime = Field(None, description="Date when article was published. Use ISO 8601 date format.")

It could fail because openai package has no format handling for datetime , so as an alternative you would want to set date_published as a str and perform format validation (e.g. ISO 8601 compliance) post-hoc.

Other key limitations include the next:

Hallucinations are still possible — for instance, when extracting product IDs, you’d define in your response schema the next: product_ids: List[str] ; while the output is guaranteed to provide an inventory of strings (product IDs), the strings themselves could also be hallucinated, so on this use case, you might wish to validate the output against some pre-defined set of product IDs.
The output is capped at 4096 tokens, or the lesser number you set inside the max_tokens parameter — so although the schema will likely be followed precisely, if the output is just too large, it’ll be truncated and produce an invalid JSON — especially annoying on very large Batch API jobs!
Deeply nested schemas with many object properties may yield API errors — there’s a limitation on the depth and breadth of your schema, but basically it’s best to keep on with flat and easy structures— not only to avoid API errors but additionally to squeeze out as much performance from the LLMs as possible (LLMs basically have trouble attending to deeply nested structures).
Highly dynamic or arbitrary schemas are usually not possible — although recursion is supported, it will not be possible to create a highly dynamic schema of let’s say, an inventory of arbitrary key-value objects, i.e. [{"key1": "val1"}, {"key2": "val2"}, ..., {"keyN": "valN"}] , for the reason that “keys” on this case must be pre-defined; in such a scenario, the perfect option will not be to make use of structured outputs in any respect, but as an alternative opt for normal JSON mode, and supply the instructions on the output structure inside the system prompt.

With all this in mind, we will now undergo a few use cases with suggestions and tricks on the way to enhance the performance when using structured outputs.

Creating flexibility using optional parameters

Let’s say we’re constructing an internet scraping application where our goal is to gather specific components from the online pages. For every web page, we supply the raw HTML within the user prompt, give specific scraping instructions within the system prompt, and define the next Pydantic model:

class Webpage(BaseModel):
title: str
paragraphs: Optional[List[str]] = Field(None, description="Text contents enclosed inside  tags.")
links: Optional[List[str]] = Field(None, description="URLs specified by `href` field inside  tags.")
images: Optional[List[str]] = Field(None, description="URLs specified by the `src` field inside the  tags.")

We might then call the API as follows…

response = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[
{
"role": "system",
"content": "You are to parse HTML and return the parsed page components."
},
{
"role": "user",
"content": """
Structured Outputs Demo
Hello world!

"""
}
],
response_format=Webpage
)

…with the next response:

{
'images': ['test.gif'],
'links': None,
'paragraphs': ['Hello world!'],
'title': 'Structured Outputs Demo'
}

Response schema supplied to the API using structured outputs must return all the desired fields. Nevertheless, we will “emulate” optional fields and add more flexibility using the Optional type annotation. We could actually also use Union[List[str], None] — they’re syntactically the exact same. In each cases, we get a conversion to anyOf keyword as per the JSON schema spec. In the instance above, since there are not any tags present on the net page, the API still returns the links field, nevertheless it is about to None .

Reducing hallucinations using enums

We mentioned previously that even when the LLM is guaranteed to follow the provided response schema, it should still hallucinate the actual values. Adding to this, a recent paper found that enforcing a set schema on the outputs, actually causes the LLM to hallucinate, or degrade by way of its reasoning capabilities.

One strategy to overcome these limitations, is to attempt to utilize enums as much as possible. Enums constrain the output to a really specific set of tokens, placing a probability of zero on all the pieces else. For instance, let’s assume you are attempting to re-rank product similarity results between a goal product that comprises a description and a singular product_id , and top-5 products that were obtained using some vector similarity search (e.g. using a cosine distance metric). Each one in all those top-5 products also contain the corresponding textual description and a singular ID. In your response you just wish to acquire the re-ranking 1–5 as an inventory (e.g. [1, 4, 3, 5, 2] ), as an alternative of getting an inventory of re-ranked product ID strings, which could also be hallucinated or invalid. We setup our Pydantic model as follows…

class Rank(IntEnum):
RANK_1 = 1
RANK_2 = 2
RANK_3 = 3
RANK_4 = 4
RANK_5 = 5class RerankingResult(BaseModel):
ordered_ranking: List[Rank] = Field(description="Provides ordered rating 1-5.")

…and run the API like so:

response = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[
{
"role": "system",
"content": """
You are to rank the similarity of the candidate products against the target product.
Ranking should be orderly, from the most similar, to the least similar.
"""
},
{
"role": "user",
"content": """
## Target Product
Product ID: X56HHGHH
Product Description: 80" Samsung LED TV## Candidate Products
Product ID: 125GHJJJGH
Product Description: NVIDIA RTX 4060 GPU
Product ID: 76876876GHJ
Product Description: Sony Walkman
Product ID: 433FGHHGG
Product Description: Sony LED TV 56"
Product ID: 777888887888
Product Description: Blueray Sony Player
Product ID: JGHHJGJ56
Product Description: BenQ PC Monitor 37" 4K UHD
"""
}
],
response_format=RerankingResult
)

The is solely:

{'ordered_ranking': [3, 5, 1, 4, 2]}

So the LLM ranked the Sony LED TV (i.e. item number “3” within the list), and the BenQ PC Monitor (i.e. item number “5”), because the two most similar product candidates, i.e. the primary two elements of the ordered_ranking list!

In this text we gave a radical deep-dive into structured outputs. We introduced the JSON schema and Pydantic models, and connected these to OpenAI’s ChatCompletions API. We walked through plenty of examples and showcased some optimal ways of resolving those using structured outputs. To summarize some key takeaways:

Structured outputs as supported by OpenAI API, and other third party frameworks, implement only a subset of the JSON schema specification — recovering informed by way of its features and limitations will allow you to make the appropriate design decisions.
Using Pydantic or similar frameworks that track JSON schema specification faithfully, is extremely really useful, because it permits you to create valid and cleaner code.
Whilst hallucinations are still expected, there are other ways of mitigating those, just by a alternative of response schema design; for instance, by utilizing enums where appropriate.

Concerning the Creator

Armin Catovic is a Secretary of the Board at Stockholm AI, and a Vice President and a Senior ML/AI Engineer on the EQT Group, with 18 years of engineering experience across Australia, South-East Asia, Europe and the US, and plenty of patents and top-tier peer-reviewed AI publications.

Diving Deeper with Structured Outputs

Helping enhance your understanding and optimal usage of structured outputs and LLMs

Creating flexibility using optional parameters

Reducing hallucinations using enums

Concerning the Creator

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Enhancing Model Security for the ML Community

WebGPU Support, Recent Models & Tasks, and More…

The Machine Learning “Advent Calendar” Day 24: Transformers for Text in Excel

Diffusers welcomes Stable Diffusion 3.5 Large

4 Techniques to Optimize AI Coding Efficiency

Diving Deeper with Structured Outputs

Helping enhance your understanding and optimal usage of structured outputs and LLMs

Creating flexibility using optional parameters

Reducing hallucinations using enums

Concerning the Creator

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.