Structured generation is one of the most important steps in moving AI agents from demos to production systems. In real applications, an agent is not just writing text for a user, it is passing decisions, tool arguments, routing outputs, validation results, and workflow states to other parts of a software pipeline. In this article, we will look at how vLLM helps enforce this structure during generation.
π Content
π 1. Motivation
π 2. Production Reality
βοΈ 3. Structured Generation in vLLM
π§© 3.1 JSON Schema-Constrained Generation
ποΈ 3.2 Pydantic Model β JSON Schema Conversion
π 1. Motivation
Why Structured Generation?
Imagine you built an AI customer support agent. A user sends: βI want to return my order #4821.β Your agent needs to call an internal API to look up the order. That API expects a clean JSON payload:
{ "order_id": "4821", "action": "return", "reason": null }
But your LLM, without any constraints, might output:
Sure! I can help with that. Here is the return request:
{ "order_id": 4821, "action": "return", "reason": "not specified" }
Let me know if you need anything else!
Three problems in that one response:
Extra text wrapped around the JSON,
- order_id is a number instead of a string,
- reason is "not specified" instead of null.
- Your json.loads() will either crash or your API will reject the payload.
In a demo, youβd just fix this with a try/except and with more prompting.
In production, the same issue can happen thousands of times a day across multiple agents, tools, and workflows. At that scale, even a 2% formatting failure rate is no longer a small bug, it becomes broken automations, failed handoffs, and real customer impact.
The core problem:
LLMs are probabilistic text generators. They predict the most likely next token, they do not inherently βknowβ that your downstream system needs a strictly-typed JSON object. Even after prompting it with JSON requirements, it might still fail to produce exact required format.
The solution: Structured generation
Structured generation guides the model to produce outputs that follow a predefined format, such as JSON, a schema, or a set of allowed choices, so the response is easier for your code to validate and use reliably.
π 2. Production Reality
Production AI agents operate in pipelines. The LLM output is almost never the final product. The LLM output is fed into databases, APIs, other models, or UI components. Each handoff requires the output to conform to a format. Structured generation is how you enforce that format at the generation level.
Here is what structured generation unlocks:
- Cleaner backend integration because the LLM output can map directly to typed application models, validation logic, APIs, and databases.
- Cleaner agent pipelines and more reliable agent handoffs because each step can pass structured data to the next step without relying on messy text interpretation.
- Fewer production failures because the model is constrained to return valid, expected outputs instead of unpredictable free text.
- Lower retry and repair cost because the system spends less time fixing bad outputs and more time executing the actual workflow.
βοΈ 3. Structured Generation in vLLM
Introduction
vLLM is mainly known as a high-throughput inference and serving engine for LLMs, but it also provides built-in support for constraining model outputs into specific formats.
In vLLM, structured generation can be used in two common ways: through structured_outputs and StructuredOutputsParams for offline inference, or through response_format / extra_body={"structured_outputs": ...} when using the OpenAI-compatible API.
Setup
pip install -U openai vllm
If you again get the NumPy Inf error, run:
pip install "numpy<2"
Run the following command in the terminal to locally host the model with vLLM.
vllm serve Qwen/Qwen2.5-1.5B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--dtype auto \
--gpu-memory-utilization 0.85 \
--max-model-len 4096 \
--enable-auto-tool-choice \
--tool-call-parser hermes
vllm serve starts vLLM as a model server. Instead of loading the model inside every notebook run, the model is loaded once in a terminal and kept running. Your notebook code then sends requests to this local server, just like it would send requests to the OpenAI API.
vLLM exposes an OpenAI-compatible API server, so the normal openai Python client can call it by changing only the base_url to http://localhost:8000/v1
Qwen/Qwen2.5-1.5B-Instruct
This is the Hugging Face model that vLLM will download/load and serve.
Run the following code. If this prints the model name, vLLM is running correctly.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="unused",
)
print(client.models.list().data[0].id)
π§© 3.1 JSON Schema-Constrained Generation
The most common use case:
You define a JSON Schema, and vLLM constrains decoding so the generated text follows that schema.
Letβs understand this with an example:
Production Use Case: Support Ticket Triage System
from openai import OpenAI
import json
client = OpenAI(
base_url="http://127.0.0.1:8000/v1",
api_key="EMPTY"
)
MODEL = "Qwen/Qwen2.5-1.5B-Instruct"
triage_schema = {
"type": "object",
"properties": {
"urgency": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
"category": {"type": "string", "enum": ["billing", "technical", "returns", "general"]},
"customer_id": {"type": ["string", "null"]},
"summary": {"type": "string", "maxLength": 200}
},
"required": ["urgency", "category", "summary"],
"additionalProperties": False
}
email_text = """
Customer #C-4821 says: My payment was charged twice yesterday
and I still haven't received any confirmation. This is urgent!
"""
prompt = f"""Analyze this support email and return only JSON.
Email:
{email_text}
"""
response = client.chat.completions.create(
model=MODEL,
messages=[{"role": "user", "content": prompt}],
temperature=0.1,
max_tokens=256,
response_format={
"type": "json_schema",
"json_schema": {
"name": "support_triage",
"schema": triage_schema
}
}
)
text = response.choices[0].message.content
print(f"text: {text}")
triage = json.loads(text)
print(f"triage: {triage}")
print(f"Urgency: {triage['urgency']}")
print(f"Category: {triage['category']}")
print(f"Summary: {triage['summary']}")
Expected Output:
text: {"urgency":"high","category":"billing","customer_id":"C-4821","summary":"Customer was charged twice yesterday and has not received confirmation."}
triage: {'urgency': 'high', 'category': 'billing', 'customer_id': 'C-4821', 'summary': 'Customer was charged twice yesterday and has not received confirmation.'}
Urgency: high
Category: billing
Summary: Customer was charged twice yesterday and has not received confirmation.
The important part and what it does?
response_format={
"type": "json_schema",
"json_schema": {
"name": "support_triage",
"schema": triage_schema
}
vLLM uses a structured-output backend such as
xgrammarorguidanceto constrain decoding.At each generation step invalid next tokens are masked from the modelβs logits before sampling.
This makes the model generate text that follows the required structure, such as a JSON schema, regex, choice list, or grammar.
Important nuance:
Structured generation only guarantees structural validity, not that the extracted values are semantically correct.
ποΈ 3.2 Pydantic Model β JSON Schema Conversion
Writing raw JSON Schema objects is tedious and error-prone. Hence use Pydantic.
from openai import OpenAI
import json
from pydantic import BaseModel, Field, ConfigDict
from typing import Literal, Optional
client = OpenAI(
base_url="http://127.0.0.1:8000/v1",
api_key="EMPTY"
)
MODEL = "Qwen/Qwen2.5-1.5B-Instruct"
# 1. Define output structure using Pydantic
class TriageOutput(BaseModel):
model_config = ConfigDict(extra="forbid")
urgency: Literal["low", "medium", "high", "critical"]
category: Literal["billing", "technical", "returns", "general"]
customer_id: Optional[str] = None
summary: str = Field(max_length=200)
# 2. Convert Pydantic model to JSON Schema
triage_schema = TriageOutput.model_json_schema()
email_text = """
Customer #C-4821 says: My payment was charged twice yesterday
and I still haven't received any confirmation. This is urgent!
"""
prompt = f"""Analyze this support email and return only JSON.
Email:
{email_text}
"""
response = client.chat.completions.create(
model=MODEL,
messages=[{"role": "user", "content": prompt}],
temperature=0.1,
max_tokens=256,
response_format={
"type": "json_schema",
"json_schema": {
"name": "support_triage",
"schema": triage_schema
}
}
)
text = response.choices[0].message.content
print(f"text: {text}")
triage = json.loads(text)
print(f"triage: {triage}")
print(f"Urgency: {triage['urgency']}")
print(f"Category: {triage['category']}")
print(f"Summary: {triage['summary']}")
Conclusion
vLLM also provides more structured generation options like Regex-Constrained Generation, Grammar-Constrained Generation and Custom Logits Processors.
An more extended version of this article covers these topics along with examples of Tool-Calling and Routing agent, in my following medium article published in Towards AI.
Link the article: AI Agents in Production: Why Structured Generation Matters More Than Prompt Engineering
I hope this article was useful in showing why structured generation is not just a formatting trick, but a practical requirement for production AI agents. When agents are part of real software pipelines, their outputs must be predictable, valid, and easy for downstream systems to use.
Top comments (0)