LLMs are great at writing code, but ask them to generate strictly formatted Markdown? That's a different story. We spent weeks optimizing our prompts to fix technical hallucinations and structural chaos, but hit a wall. Eventually, we stopped trying to solve it with words alone and built a pipeline using a Judge-Write loop with experience replay.
The result was immediate: content generation accuracy jumped from 77% to 94%.
The Problem: System Failure Again
While building an automated technical documentation system, our Writer Agent kept producing content with SQL syntax errors and logic gaps. It couldn't guarantee strict Markdown compliance, causing frequent crashes in the rendering layer.
The core challenge was maintaining strict data structure rigor without sacrificing speed (latency < 3s) or falling into infinite retry loops. If left unchecked, our online error rate would stay above 20%, triggering over 40 weekly alerts and destroying user trust.
Root Cause Analysis
1. Prompt Engineering Failed
Simply increasing prompt complexity (like Chain of Thought) didn't fix structural errors. LLMs still struggle with complex Markdown tables. Asking one model to be purely creative yet strictly rigorous is a losing battle.
2. No Immediate Feedback
The Writer Agent was a one-shot process. If it generated an error, it outputted it directly. There was no mechanism for self-correction or intermediate quality control—like taking an exam without a teacher to grade it.
3. Experience Wasn't Reusable
Every generation was independent. The system couldn't remember which patterns (like specific SQL syntax) were correct, leading to repeated errors. The agent kept falling into the same holes.
The Solution: Let AI Be the Judge
We decoupled generation from evaluation by introducing an independent Judge Agent for syntax validation and logic review. If the Writer can't be trusted, we gave it a strict quality control officer.
The Judge-Write Loop:
# Before: Single Writer direct output
response = writer_agent.generate(prompt)
return response
# After: Judge closed-loop control
max_retries = 3
for i in range(max_retries):
draft = writer_agent.generate(prompt)
feedback = judge_agent.evaluate(draft)
if feedback.is_valid:
return draft
else:
prompt = refine_prompt_with_feedback(prompt, feedback)
raise MaxRetriesExceededError()
Pattern-Based Experience Storage:
Instead of guessing blindly every time, the Writer now references "top student" homework. We extract high-quality code blocks approved by the Judge and store them as patterns in a Vector DB.
# Before: Cold start every time
messages = [{'role': 'system', 'content': 'You are a writer...'}]
# After: Inject successful experience Memory
relevant_patterns = memory.search(query=current_topic)
system_prompt = f"You are a writer. Reference these successful patterns: {relevant_patterns}"
messages = [{'role': 'system', 'content': system_prompt}]
Architectural Decisions
| Decision | Alternative | Rationale |
|---|---|---|
| Independent Judge Agent | Self-Correction (Self-Refine) | The same model has "blind spots." An independent model offers a more objective view and allows us to fine-tune the Judge specifically for inspection tasks. |
| Pattern Storage | Pure Fine-tuning | Fine-tuning is costly and lags behind. Vector DB storage of high-frequency successful patterns enables "next-day" iteration, cutting costs by 90%. |
Production Takeaways
- Trust, but Verify: Even GPT-4o level models require a post-validation layer for structured data. Without it, production incident rates are unacceptably high.
- Separation of Concerns: The Writer handles "creativity," the Judge handles "rigor." Clear role definitions reduce system complexity better than a single all-powerful Agent.
- Experience is Data: Feeding approved outputs back into the system creates a flywheel effect. Over time, average retry次数 dropped from 2.1 to 0.8.
Next time your LLM output is full of hallucinations, stop tweaking the prompt. Try giving it a strict Judge instead.
Top comments (0)