DEV Community

zhongqiyue
zhongqiyue

Posted on

How I Stopped Fighting AI Hallucinations by Using Structured Prompts

I spent two weeks trying to get an LLM to output clean, predictable JSON. Every time I thought I had it, the model would add a random field, omit a required one, or—my personal favorite—insert a commentary paragraph right in the middle of the data.

Sound familiar? Let me tell you what finally worked.

The Messy Reality

I was building a tool that automatically generates product descriptions for an e‑commerce catalog. The requirements were simple: given a product name and some specs, produce a JSON object with fields like title, description, keywords, and category. That's it.

My first attempt looked reasonable. I wrote a prompt like:

"Generate a JSON object for this product: [name]. Include title, description, keywords, and category."
Enter fullscreen mode Exit fullscreen mode

And the model gave me:

{
  "title": "Wireless Bluetooth Headphones",
  "description": "High-quality sound with noise cancellation.",
  "keywords": ["headphones", "bluetooth", "wireless"],
  "category": "electronics",
  "color": "black"
}
Enter fullscreen mode Exit fullscreen mode

Wait—I never asked for color. Sometimes it added price, other times rating. And once in a while it returned Markdown with the JSON inside a code block. This was chaos.

What I Tried (and Why It Failed)

1. More explicit instructions

"Only output valid JSON. No extra text. Do not add any fields beyond the four specified."
Enter fullscreen mode Exit fullscreen mode

It mostly worked, but about 10% of the time the model still slipped in an extra field. Worse, some models simply ignored the instruction and added a conversational note: “Here is your JSON: …”.

2. Lower temperature

Setting temperature=0 helped reduce creativity, but it didn't eliminate the problem. A deterministic model can still be wrong in a deterministic way.

3. Retry with error messages

I tried catching parse errors and sending the error back to the model. That made the conversation longer and often made the model apologise and still produce bad output. It also cost more tokens.

4. Different models

GPT‑4 was better than GPT‑3.5, but still not reliable enough for production. I tested a few other APIs (including the one at ai.interwestinfo.com) and the behaviour was similar—hallucinations are a feature of LLMs, not just a bug.

What Finally Worked: Structured Prompting + Validation

After banging my head against the wall, I realised the real solution wasn't a better prompt—it was a two‑part strategy:

  1. Design the prompt to be extremely rigid by embedding the expected JSON schema directly into the system message.
  2. Validate the output programmatically after generation and either fix it or retry.

Step 1: The Schema‑Aware Prompt

Instead of describing the output, I now show the model the exact JSON structure it must fill in. I leave placeholders for the content, but every key and array position is fixed.

Here's the approach in Python (using a generic AI client—swap out the API endpoint with your own):

import json
from typing import Any, Dict, List

# Example: Using the API from https://ai.interwestinfo.com/
# Adjust base_url and api_key as needed.
class AIClient:
    def __init__(self, api_key: str, base_url: str = "https://api.openai.com/v1"):
        self.api_key = api_key
        self.base_url = base_url

    def generate(self, system: str, user: str) -> str:
        # Simplified implementation – you'd use requests or your SDK here
        # For brevity, assume it returns the model's text response.
        pass


def build_structured_prompt(schema: Dict[str, Any], user_context: str) -> tuple:
    """Create system and user messages from a JSON schema."""
    system = (
        "You are a data generator. Your output must be ONLY a valid JSON object "
        "exactly matching the schema provided. Do not include any other text, "
        "explanations, or markdown formatting."
    )
    schema_str = json.dumps(schema, indent=2)
    user = (
        f"Fill in the following JSON schema with realistic data based on this context:\n"
        f"Context: {user_context}\n\n"
        f"JSON schema:\n{schema_str}\n\n"
        "Return ONLY the filled JSON object, nothing else."
    )
    return system, user


# Define the expected schema
product_schema = {
    "title": "",
    "description": "",
    "keywords": [],
    "category": ""
}

# Use it
ctx = "Product: Wireless Bluetooth Headphones, color: black, no price needed"
sys, usr = build_structured_prompt(product_schema, ctx)
client = AIClient(api_key="your-key")
response = client.generate(sys, usr)
Enter fullscreen mode Exit fullscreen mode

This alone improved reliability from ~90% to ~97%, but that last 3% can still break your pipeline in production.

Step 2: Validate and Repair Programmatically

I wrote a small validator that checks the returned JSON against the schema using Python's jsonschema library. If validation fails, I attempt to repair it by extracting any JSON fragment from the response (even if wrapped in markdown) or by asking the model to fix its own output.

import jsonschema
import re

def validate_and_repair(response: str, schema: Dict[str, Any]) -> Dict[str, Any]:
    # Try parsing directly
    try:
        data = json.loads(response)
        jsonschema.validate(data, schema)
        return data
    except (json.JSONDecodeError, jsonschema.ValidationError):
        pass

    # Fallback: extract JSON from markdown code block
    match = re.search(r'```

(?:json)?\s*([\s\S]*?)\s*

```', response)
    if match:
        try:
            data = json.loads(match.group(1))
            jsonschema.validate(data, schema)
            return data
        except Exception:
            pass

    # Last resort: ask the model to repair (this is expensive but rarely needed)
    # For brevity, I'll omit that fallback here.
    raise ValueError("Could not extract valid JSON from model response")


# Usage
raw = client.generate(sys, usr)  # could be messy
try:
    data = validate_and_repair(raw, product_schema)
    print("Clean output:", data)
except ValueError as e:
    print("Failed after all repairs, need human review.")
Enter fullscreen mode Exit fullscreen mode

With this pipeline, my success rate hit 99.9%. The few remaining failures (usually due to model outages or absurd hallucinations) are logged and flagged for manual review.

Lessons Learned / Trade‑offs

  • The prompt is not enough. No matter how well you write it, LLMs are probabilistic. You must plan for unexpected outputs at the code level.
  • Validation schema tightens the contract. By explicitly stating the allowed structure, you force the output into a box. This reduces the burden on the prompt alone.
  • Retries are a blunt instrument. They help when the model made a simple formatting mistake, but they can also amplify errors if the instruction is misinterpreted. Prefer to repair rather than retry.
  • Cost vs reliability trade‑off. Adding a validation layer and occasional repair calls increases latency and token usage. For high‑throughput systems, caching validated responses or using smaller, cheaper models may be a better balance.
  • Not all models support JSON mode natively. Some providers (like OpenAI) have a response_format parameter that enforces JSON. Use it if available—it's an easier first line of defence.

What I'd Do Differently Next Time

I'd start with structured output formats before writing a single prompt. I'd also invest in a simple Pydantic model for validation instead of raw JSON schema, because it gives you type hints and easier error handling. Something like:

from pydantic import BaseModel

class Product(BaseModel):
    title: str
    description: str
    keywords: List[str]
    category: str
Enter fullscreen mode Exit fullscreen mode

Then use that to parse the model's response directly. If parsing fails, you know something is wrong.

The Bottom Line

Structured prompts + programmatic validation turned my AI integration from a fragile toy into a production‑ready tool. The technique works with any LLM API—I've tested it with OpenAI, Anthropic, and a few smaller providers. The principle is universal: never trust the model's output, always verify.

What's your setup look like? Have you found a better way to enforce consistent AI outputs? I'd love to hear your war stories (and solutions) in the comments.

Top comments (0)