If you have ever shipped an LLM feature to production, you know the quiet terror of json.loads(). The model returns almost valid JSON — a stray comma here, a chatty "Sure, here's your data!" preamble there, a field renamed on a whim — and your parser explodes. For years the fix was a grab-bag of regex, retries, and prayer. Structured Outputs ends that era. It guarantees the model's response conforms to a JSON Schema you define, not approximately, but exactly.
This guide is a practical playbook for using Structured Outputs well: when to reach for it, how the two API modes differ, the schema rules that trip everyone up, and how the same idea works outside OpenAI's ecosystem.
What "guaranteed" actually means
OpenAI introduced Structured Outputs in the API on August 6, 2024. The promise is unusually strong for a probabilistic system: when you supply a schema with strict: true, model outputs will match that schema. Not "usually." Not "after a retry." Exactly.
The numbers behind the claim are worth internalizing. On OpenAI's eval of complex JSON schema following, gpt-4o-2024-08-06 with Structured Outputs scores a perfect 100%. The older gpt-4-0613, prompted to produce the same JSON, scores less than 40%. Even the new model on its own — trained specifically to understand schemas — only reached 93%. The last seven points came not from better training but from a hard engineering guarantee. We'll get to how that works.
The key distinction: this is not the old JSON mode. JSON mode (type: "json_object") only guarantees the output is syntactically valid JSON. It says nothing about which fields appear or what types they hold. Structured Outputs guarantees the actual shape. Treat JSON mode as legacy and reach for schema-based Structured Outputs by default.
The two modes, and when to use each
Structured Outputs ships in two forms. Picking the right one matters.
1. Function calling (tools). Set strict: true inside a function definition. Use this when the model is deciding whether and how to call a tool — querying a database, hitting an API, taking an action in an agentic workflow. This mode works with every model that supports tools, going back to gpt-4-0613 and gpt-3.5-turbo-0613.
{
"type": "function",
"function": {
"name": "query_orders",
"strict": true,
"parameters": {
"type": "object",
"properties": {
"status": { "type": "string", "enum": ["fulfilled", "shipped", "canceled"] },
"limit": { "type": "integer" }
},
"required": ["status", "limit"],
"additionalProperties": false
}
}
}
2. Response format (response_format with json_schema). Use this when the model is answering the user directly in a structured shape — extracting fields from a document, returning a UI tree, separating reasoning from a final answer. This mode requires a newer model: gpt-4o-2024-08-06 or gpt-4o-mini-2024-07-18 (and fine-tunes built on them).
{
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "action_items",
"strict": true,
"schema": { "...": "your schema here" }
}
}
}
Rule of thumb: if the model is acting, use function calling. If the model is reporting, use
response_format.
The schema rules everyone gets wrong
Structured Outputs supports a subset of JSON Schema, and the constraints are strict enough that copy-pasting an arbitrary schema usually fails. Three rules cause the vast majority of errors.
Every object must set additionalProperties: false. This is non-negotiable in strict mode, and it must appear on every nested object, not just the root. It's what stops the model from inventing extra keys.
Every field must be listed in required. Strict mode does not support "optional" in the usual JSON Schema sense. If a field can be absent, you don't mark it optional — you make it nullable by giving it a union type:
"due_date": {
"type": ["string", "null"],
"description": "Due date, or null if not specified."
}
This is the single most common gotcha. You want optionality; the schema gives you nullability. Design around it: the field is always present, but its value may be null.
Use the supported type set. You get string, number, integer, boolean, object, array, and null, plus anyOf for unions and enum for closed value sets. Lean on enum aggressively — it's the cleanest way to keep the model inside a known vocabulary.
A clean, strict-compliant extraction schema looks like this:
{
"type": "object",
"properties": {
"action_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": { "type": "string" },
"due_date": { "type": ["string", "null"] },
"owner": { "type": ["string", "null"] }
},
"required": ["description", "due_date", "owner"],
"additionalProperties": false
}
}
},
"required": ["action_items"],
"additionalProperties": false
}
Let the SDK write the schema for you
Hand-writing JSON Schema is tedious and error-prone. Don't. The official Python and Node SDKs accept a Pydantic model or a Zod schema directly, convert it to a compliant JSON Schema, and deserialize the response back into your typed object automatically.
from pydantic import BaseModel
from openai import OpenAI
class Step(BaseModel):
explanation: str
output: str
class MathResponse(BaseModel):
steps: list[Step]
final_answer: str
client = OpenAI()
completion = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[
{"role": "system", "content": "You are a helpful math tutor."},
{"role": "user", "content": "solve 8x + 31 = 2"},
],
response_format=MathResponse,
)
msg = completion.choices[0].message
if msg.parsed:
print(msg.parsed.final_answer)
else:
print(msg.refusal)
You get type safety end to end: your IDE knows the shape, the schema is generated from the same source of truth, and there's no second copy of the structure to drift out of sync.
How it works under the hood
Understanding the mechanism explains both the guarantee and its limits. OpenAI took a two-part approach.
First, they trained gpt-4o-2024-08-06 to understand complicated schemas. That alone got to 93% — good, not good enough. So they added a deterministic layer: constrained decoding.
Normally a model can sample any token from its vocabulary at each step, which is exactly why it's free to emit an invalid curly brace or a trailing comma. Constrained decoding converts your JSON Schema into a context-free grammar (CFG), and after each generated token the inference engine masks every token that the grammar says is invalid next. Invalid tokens effectively drop to zero probability. The model literally cannot produce malformed output.
Why a CFG and not a regex or finite-state machine? Because CFGs express a broader class of languages. FSMs can't generally handle recursion, so they struggle with deeply nested or self-referential structures (think a UI tree where every node can contain more nodes). The CFG approach handles recursive $ref schemas that FSM-based tools choke on.
There's one practical cost: the first request with a new schema pays a latency penalty while the grammar is compiled and cached. Typical schemas process in under 10 seconds; very complex ones can take up to a minute. Every subsequent request with that schema runs with no penalty.
Limits worth knowing before you ship
Structured Outputs is powerful, not magic. Keep these in mind:
- It guarantees structure, not correctness. The model can still put a wrong value in a correctly-shaped field — a botched arithmetic step, a misread date. Validate values, not just schema.
- It's incompatible with parallel function calls. If parallel tool calls fire, they may not match your schemas. Set
parallel_tool_calls: false. - A refusal breaks the shape on purpose. If the model refuses an unsafe request, you get a
refusalstring instead of schema-matching output. Always branch on it. - Hitting
max_tokenstruncates the JSON. A stop condition mid-generation leaves you with invalid output. Checkfinish_reason.
Beyond OpenAI
The pattern is now industry-wide, so you're not locked in. Anthropic achieves the same end through tool use — define a tool's input schema and the model fills it. xAI and Azure OpenAI expose compatible structured-output parameters. And the technique was pioneered in open source: libraries like Outlines, Instructor, Jsonformer, and Guidance brought constrained decoding to local and self-hosted models long before the big APIs shipped it. OpenAI explicitly credits them.
If you're running your own inference stack, Outlines or Instructor will give you the same guarantee against an open-weight model.
The Bottom Line
Structured Outputs turns the flakiest part of LLM engineering — getting reliable, parseable data out — into a solved problem. Reach for response_format when the model reports and function calling when it acts. Remember the two rules that cause most failures: set additionalProperties: false on every object, and express optionality as nullable fields in required. Let Pydantic or Zod generate the schema so you keep one source of truth. Then validate the values, because a perfectly-shaped object can still be wrong. Do that, and you can finally delete the regex.


