DEV Community

zhongqiyue
zhongqiyue

Posted on

How I Stopped Fighting LLMs for Structured Output (and What I Learned)

The Hallucination That Broke My CI Pipeline

Two months ago, I was building an internal documentation assistant. The idea was simple: feed it a function’s docstring and some context, and it would spit out a clean JSON object with name, description, parameters, and example. The LLM I was using (GPT-4 via API) could do it—sometimes. But about 30% of the time it would:

  • Sprinkle extra text before/after the JSON
  • Use single quotes instead of double quotes (invalid JSON)
  • Include a completely fake parameter I never asked for
  • Forget to close a bracket

That third category is what broke my pipeline. The fake parameters would get parsed into our API schema, generate invalid OpenAPI specs, and fail the CI build. We were spending more time validating than writing.

What I Tried First (and Why It Failed)

1. Begging with better prompts

I tried the classic "only respond with valid JSON, no markdown, no explanations". I even appended Return ONLY a JSON object. in bold. Still, about 10% of responses included a polite "Here is the JSON:" before the output.

2. Regex and manual parsing

I wrote a messy function that stripped everything before { and after }. That worked for simple cases, but any nested objects or escaped braces broke the regex. My teammates started calling it "the scramble function".

3. Post-processing with json.loads() + retry

I wrapped the response in a try-catch. If it failed, I sent the same prompt again. That doubled our costs and latency, and the second attempt was equally likely to fail.

None of these addressed the root cause: the model doesn't inherently understand syntax. It approximates patterns.

What Actually Worked: Structured Generation with Output Parsers

Instead of fighting the model's natural text output, I flipped the approach: I constrained the model's output format from the start using a schema-backed prompt and a parser that forces the output into a validated structure. The core technique is:

  1. Define the expected JSON schema (using Pydantic, for example).
  2. Serialize the schema into a clear, machine-readable description in the system prompt.
  3. Ask the model to output a JSON object that matches that schema.
  4. Use a library like instructor or lmql to extract the JSON reliably.

Here’s a minimal Python example using instructor (a thin wrapper around OpenAI) that validates on the fly:

from pydantic import BaseModel, Field
from typing import List
import openai
from instructor import patch

# Patch the OpenAI client to support structured output
patch()

class Parameter(BaseModel):
    name: str = Field(description="Name of the parameter")
    type: str = Field(description="Data type, e.g. string, int")
    description: str = Field(description="Purpose of the parameter")

class FunctionDoc(BaseModel):
    name: str
    description: str
    parameters: List[Parameter]
    example: str

client = openai.OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    response_model=FunctionDoc,   # <-- this is the magic
    messages=[
        {
            "role": "system",
            "content": "You are a technical writer. Extract the function signature and docs."
        },
        {
            "role": "user",
            "content": "def fetch_user(user_id: int) -> User: ..."
        }
    ]
)

print(response.name)  # It's a Pydantic object, always valid
Enter fullscreen mode Exit fullscreen mode

Under the hood, instructor sends a modified prompt that includes the schema and uses OpenAI’s function calling capability to get a JSON response that must validate. If the validation fails, it sends a retry automatically with the error message. No regex, no guessing.

The results

  • Success rate went from 70% to ~99.5%
  • The remaining 0.5% are cases where the model misunderstands the content (e.g., wrong parameter type) but still produces valid JSON that I can easily fix.
  • Latency increased by ~100ms because of the extra function-call roundtrip, but we eliminated the retry loop, so overall it’s faster.

Lessons Learned and Trade-offs

  • This only works with models that support function/tool calling. If you’re using a local model or a provider that doesn’t expose this, you’ll need a different approach (like grammar-based sampling).
  • You pay for the extra API call. The structured generation uses a second call under the hood. For high-volume, low-cost apps, this might not be ideal.
  • Schema design matters. If your schema is too complex (deep nesting, huge enums), the model can choke or hallucinate. Keep it flat and explicit.
  • The technique is model-agnostic in theory, but not in practice. I tested with GPT-4o and Claude 3.5—both work well. Smaller models (e.g., GPT-3.5) still struggle to adhere.

When NOT to Use This

If you’re building a chatbot where the output is purely natural language (like a friendly email reply), structured generation is overkill. Let the model write freely and only parse what you need. Also, if you can tolerate occasional retries and your request volume is low, the simple try-catch approach is simpler to implement and debug.

But if you’re building an automation pipeline that consumes LLM outputs as typed data—and you’ve felt the pain of a CI failure because of a missing comma—this pattern is a lifesaver.

What I’d Do Differently Next Time

  1. Start with schema-driven validation from day one. I wasted a week on prompt engineering when I should have reached for a structured output library immediately.
  2. Build a small test set of edge cases (nested objects, empty lists, duplicate keys) before rolling to production.
  3. Monitor for schema drift. If the model changes or you update your schema, old prompts might break. Add a simple test in CI that runs a few sample inputs and checks the output structure.

I’m still experimenting with alternative tools—one service I recently tested (ai.interwestinfo.com) provides a similar structured generation API but abstracts the function-calling layer entirely. It’s interesting, but the technique itself is what matters.

Have you run into similar reliability issues with LLM outputs? What’s your setup for keeping them in line? I’d love to hear what’s worked for you.

Top comments (0)