OpenAI recently announced support for Structured Outputs in its latest gpt-4o-2024–08–06 models. Structured outputs in relation to large language models (LLMs) are nothing latest — developers have either used various prompt engineering techniques, or third party tools.
In this text we’ll explain what structured outputs are, how they work, and the way you’ll be able to apply them in your personal LLM based applications. Although OpenAI’s announcement makes it quite easy to implement using their APIs (as we’ll exhibit here), chances are you’ll need to as an alternative go for the open source Outlines package (maintained by the beautiful folks over at dottxt), since it could be applied to each the self-hosted open-weight models (e.g. Mistral and LLaMA), in addition to the proprietary APIs (Disclaimer: because of this issue Outlines doesn’t as of this writing support structured JSON generation via OpenAI APIs; but that may change soon!).
If RedPajama dataset is any indication, the overwhelming majority of pre-training data is human text. Due to this fact “natural language” is the native domain of LLMs — each within the input, in addition to the output. Once we construct applications nevertheless, we would really like to make use of machine-readable formal structures or schemas to encapsulate our data input/output. This fashion we construct robustness and determinism into our applications.
Structured Outputs is a mechanism by which we implement a pre-defined schema on the LLM output. This typically implies that we implement a JSON schema, nevertheless it shouldn’t be limited to JSON only — we could in principle implement XML, Markdown, or a totally custom-made schema. The advantages of Structured Outputs are two-fold:
- Simpler prompt design — we want not be overly verbose when specifying how the output should appear like
- Deterministic names and kinds — we will guarantee to acquire for instance, an attribute
age
with aNumber
JSON type within the LLM response
For this instance, we’ll use the primary sentence from Sam Altman’s Wikipedia entry…
Samuel Harris Altman (born April 22, 1985) is an American entrepreneur and investor best referred to as the CEO of OpenAI since 2019 (he was briefly fired and reinstated in November 2023).
…and we’re going to use the newest GPT-4o checkpoint as a named-entity recognition (NER) system. We’ll implement the next JSON schema:
json_schema = {
"name": "NamedEntities",
"schema": {
"type": "object",
"properties": {
"entities": {
"type": "array",
"description": "List of entity names and their corresponding types",
"items": {
"type": "object",
"properties": {
"name": {
"type": "string",
"description": "The actual name as laid out in the text, e.g. an individual's name, or the name of the country"
},
"type": {
"type": "string",
"description": "The entity type, resembling 'Person' or 'Organization'",
"enum": ["Person", "Organization", "Location", "DateTime"]
}
},
"required": ["name", "type"],
"additionalProperties": False
}
}
},
"required": ["entities"],
"additionalProperties": False
},
"strict": True
}
In essence, our LLM response should contain a NamedEntities
object, which consists of an array of entities
, each containing a name
and type
. There are a number of things to notice here. We are able to for instance implement Enum type, which may be very useful in NER since we will constrain the output to a hard and fast set of entity types. We must specify all of the fields within the required
array: nevertheless, we can even emulate “optional” fields by setting the sort to e.g. ["string", null]
.
We are able to now pass our schema, along with the info and the instructions to the API. We want to populate the response_format
argument with a dict where we set type
to "json_schema”
after which supply the corresponding schema.
completion = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[
{
"role": "system",
"content": """You are a Named Entity Recognition (NER) assistant.
Your job is to identify and return all entity names and their
types for a given piece of text. You are to strictly conform
only to the following entity types: Person, Location, Organization
and DateTime. If uncertain about entity type, please ignore it.
Be careful of certain acronyms, such as role titles "CEO", "CTO",
"VP", etc - these are to be ignore.""",
},
{
"role": "user",
"content": s
}
],
response_format={
"type": "json_schema",
"json_schema": json_schema,
}
)
The output should look something like this:
{ 'entities': [ {'name': 'Samuel Harris Altman', 'type': 'Person'},
{'name': 'April 22, 1985', 'type': 'DateTime'},
{'name': 'American', 'type': 'Location'},
{'name': 'OpenAI', 'type': 'Organization'},
{'name': '2019', 'type': 'DateTime'},
{'name': 'November 2023', 'type': 'DateTime'}]}
The complete source code utilized in this text is obtainable here.
The magic is in the mixture of constrained sampling, and context free grammar (CFG). We mentioned previously that the overwhelming majority of pre-training data is “natural language”. Statistically because of this for each decoding/sampling step, there may be a non-negligible probability of sampling some arbitrary token from the learned vocabulary (and in modern LLMs, vocabularies typically stretch across 40 000+ tokens). Nevertheless, when coping with formal schemas, we would like to rapidly eliminate all improbable tokens.
Within the previous example, if now we have already generated…
{ 'entities': [ {'name': 'Samuel Harris Altman',
…then ideally we would really like to position a really high logit bias on the 'typ
token in the subsequent decoding step, and really low probability on all the opposite tokens within the vocabulary.
That is in essence what happens. Once we supply the schema, it gets converted right into a formal grammar, or CFG, which serves to guide the logit bias values in the course of the decoding step. CFG is one in every of those old-school computer science and natural language processing (NLP) mechanisms that’s making a comeback. A really nice introduction to CFG was actually presented in this StackOverflow answer, but essentially it’s a way of describing transformation rules for a group of symbols.
Structured Outputs are nothing latest, but are definitely becoming top-of-mind with proprietary APIs and LLM services. They supply a bridge between the erratic and unpredictable “natural language” domain of LLMs, and the deterministic and structured domain of software engineering. Structured Outputs are essentially a must for anyone designing complex LLM applications where LLM outputs have to be shared or “presented” in various components. While API-native support has finally arrived, builders must also think about using libraries resembling Outlines, as they supply a LLM/API-agnostic way of coping with structured output.