DEV Community

Cover image for AI Agents in Production: Structured Generation for AI Workflows
Shakti Wadekar
Shakti Wadekar

Posted on

AI Agents in Production: Structured Generation for AI Workflows

Structured generation is one of the most important steps in moving AI agents from demos to production systems. In real applications, an agent is not just writing text for a user, it is passing decisions, tool arguments, routing outputs, validation results, and workflow states to other parts of a software pipeline. In this article, we will look at how vLLM helps enforce this structure during generation.


πŸ“š Content

πŸš€ 1. Motivation

🏭 2. Production Reality

βš™οΈ 3. Structured Generation in vLLM

🧩 3.1 JSON Schema-Constrained Generation
πŸ—οΈ 3.2 Pydantic Model β†’ JSON Schema Conversion

πŸš€ 1. Motivation

Why Structured Generation?

Imagine you built an AI customer support agent. A user sends: β€œI want to return my order #4821.” Your agent needs to call an internal API to look up the order. That API expects a clean JSON payload:

{ "order_id": "4821", "action": "return", "reason": null }
Enter fullscreen mode Exit fullscreen mode

But your LLM, without any constraints, might output:

Sure! I can help with that. Here is the return request:
{ "order_id": 4821, "action": "return", "reason": "not specified" }
Let me know if you need anything else!
Enter fullscreen mode Exit fullscreen mode

Three problems in that one response:

Extra text wrapped around the JSON,

  1. order_id is a number instead of a string,
  2. reason is "not specified" instead of null.
  3. Your json.loads() will either crash or your API will reject the payload.

In a demo, you’d just fix this with a try/except and with more prompting.

In production, the same issue can happen thousands of times a day across multiple agents, tools, and workflows. At that scale, even a 2% formatting failure rate is no longer a small bug, it becomes broken automations, failed handoffs, and real customer impact.

The core problem:

LLMs are probabilistic text generators. They predict the most likely next token, they do not inherently β€œknow” that your downstream system needs a strictly-typed JSON object. Even after prompting it with JSON requirements, it might still fail to produce exact required format.

The solution: Structured generation

Structured generation guides the model to produce outputs that follow a predefined format, such as JSON, a schema, or a set of allowed choices, so the response is easier for your code to validate and use reliably.


🏭 2. Production Reality

Production AI agents operate in pipelines. The LLM output is almost never the final product. The LLM output is fed into databases, APIs, other models, or UI components. Each handoff requires the output to conform to a format. Structured generation is how you enforce that format at the generation level.

Here is what structured generation unlocks:

  1. Cleaner backend integration because the LLM output can map directly to typed application models, validation logic, APIs, and databases.
  2. Cleaner agent pipelines and more reliable agent handoffs because each step can pass structured data to the next step without relying on messy text interpretation.
  3. Fewer production failures because the model is constrained to return valid, expected outputs instead of unpredictable free text.
  4. Lower retry and repair cost because the system spends less time fixing bad outputs and more time executing the actual workflow.

βš™οΈ 3. Structured Generation in vLLM

Introduction

vLLM is mainly known as a high-throughput inference and serving engine for LLMs, but it also provides built-in support for constraining model outputs into specific formats.

In vLLM, structured generation can be used in two common ways: through structured_outputs and StructuredOutputsParams for offline inference, or through response_format / extra_body={"structured_outputs": ...} when using the OpenAI-compatible API.

Setup

pip install -U openai vllm
Enter fullscreen mode Exit fullscreen mode

If you again get the NumPy Inf error, run:

pip install "numpy<2"
Enter fullscreen mode Exit fullscreen mode

Run the following command in the terminal to locally host the model with vLLM.

vllm serve Qwen/Qwen2.5-1.5B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --dtype auto \
  --gpu-memory-utilization 0.85 \
  --max-model-len 4096 \
  --enable-auto-tool-choice \
  --tool-call-parser hermes
Enter fullscreen mode Exit fullscreen mode

vllm serve starts vLLM as a model server. Instead of loading the model inside every notebook run, the model is loaded once in a terminal and kept running. Your notebook code then sends requests to this local server, just like it would send requests to the OpenAI API.

vLLM exposes an OpenAI-compatible API server, so the normal openai Python client can call it by changing only the base_url to http://localhost:8000/v1

Qwen/Qwen2.5-1.5B-Instruct
This is the Hugging Face model that vLLM will download/load and serve.

Run the following code. If this prints the model name, vLLM is running correctly.

from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="unused",
)
print(client.models.list().data[0].id)
Enter fullscreen mode Exit fullscreen mode

🧩 3.1 JSON Schema-Constrained Generation

The most common use case:

You define a JSON Schema, and vLLM constrains decoding so the generated text follows that schema.

Let’s understand this with an example:

Production Use Case: Support Ticket Triage System

from openai import OpenAI
import json

client = OpenAI(
    base_url="http://127.0.0.1:8000/v1",
    api_key="EMPTY"
)

MODEL = "Qwen/Qwen2.5-1.5B-Instruct"

triage_schema = {
    "type": "object",
    "properties": {
        "urgency": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
        "category": {"type": "string", "enum": ["billing", "technical", "returns", "general"]},
        "customer_id": {"type": ["string", "null"]},
        "summary": {"type": "string", "maxLength": 200}
    },
    "required": ["urgency", "category", "summary"],
    "additionalProperties": False
}

email_text = """
Customer #C-4821 says: My payment was charged twice yesterday
and I still haven't received any confirmation. This is urgent!
"""

prompt = f"""Analyze this support email and return only JSON.

Email:
{email_text}
"""

response = client.chat.completions.create(
    model=MODEL,
    messages=[{"role": "user", "content": prompt}],
    temperature=0.1,
    max_tokens=256,
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "support_triage",
            "schema": triage_schema
        }
    }
)

text = response.choices[0].message.content
print(f"text: {text}")

triage = json.loads(text)
print(f"triage: {triage}")

print(f"Urgency: {triage['urgency']}")
print(f"Category: {triage['category']}")
print(f"Summary: {triage['summary']}")
Enter fullscreen mode Exit fullscreen mode
Expected Output:

text: {"urgency":"high","category":"billing","customer_id":"C-4821","summary":"Customer was charged twice yesterday and has not received confirmation."}

triage: {'urgency': 'high', 'category': 'billing', 'customer_id': 'C-4821', 'summary': 'Customer was charged twice yesterday and has not received confirmation.'}

Urgency: high
Category: billing
Summary: Customer was charged twice yesterday and has not received confirmation.
Enter fullscreen mode Exit fullscreen mode

The important part and what it does?

response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "support_triage",
            "schema": triage_schema
        }
Enter fullscreen mode Exit fullscreen mode

vLLM uses a structured-output backend such as xgrammar or guidance to constrain decoding.

At each generation step invalid next tokens are masked from the model’s logits before sampling.

This makes the model generate text that follows the required structure, such as a JSON schema, regex, choice list, or grammar.

Important nuance:

Structured generation only guarantees structural validity, not that the extracted values are semantically correct.


πŸ—οΈ 3.2 Pydantic Model β†’ JSON Schema Conversion

Writing raw JSON Schema objects is tedious and error-prone. Hence use Pydantic.

from openai import OpenAI
import json

from pydantic import BaseModel, Field, ConfigDict
from typing import Literal, Optional

client = OpenAI(
    base_url="http://127.0.0.1:8000/v1",
    api_key="EMPTY"
)

MODEL = "Qwen/Qwen2.5-1.5B-Instruct"

# 1. Define output structure using Pydantic
class TriageOutput(BaseModel):
    model_config = ConfigDict(extra="forbid")

    urgency: Literal["low", "medium", "high", "critical"]
    category: Literal["billing", "technical", "returns", "general"]
    customer_id: Optional[str] = None
    summary: str = Field(max_length=200)

# 2. Convert Pydantic model to JSON Schema
triage_schema = TriageOutput.model_json_schema()

email_text = """
Customer #C-4821 says: My payment was charged twice yesterday
and I still haven't received any confirmation. This is urgent!
"""

prompt = f"""Analyze this support email and return only JSON.

Email:
{email_text}
"""

response = client.chat.completions.create(
    model=MODEL,
    messages=[{"role": "user", "content": prompt}],
    temperature=0.1,
    max_tokens=256,
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "support_triage",
            "schema": triage_schema
        }
    }
)

text = response.choices[0].message.content
print(f"text: {text}")

triage = json.loads(text)
print(f"triage: {triage}")

print(f"Urgency: {triage['urgency']}")
print(f"Category: {triage['category']}")
print(f"Summary: {triage['summary']}")
Enter fullscreen mode Exit fullscreen mode

Conclusion

vLLM also provides more structured generation options like Regex-Constrained Generation, Grammar-Constrained Generation and Custom Logits Processors.

An more extended version of this article covers these topics along with examples of Tool-Calling and Routing agent, in my following medium article published in Towards AI.

Link the article: AI Agents in Production: Why Structured Generation Matters More Than Prompt Engineering

I hope this article was useful in showing why structured generation is not just a formatting trick, but a practical requirement for production AI agents. When agents are part of real software pipelines, their outputs must be predictable, valid, and easy for downstream systems to use.

Top comments (0)