I spent the last few months building a system where 8 AI agents collaborate to turn a topic into a publication-ready LinkedIn post. Not a single prompt — a pipeline where each agent has one job, and they pass work between each other with feedback loops.
Here's how it works and what I learned.
The Problem
I was writing LinkedIn posts manually. Research a topic, write the draft, edit it, find hashtags, create a visual, schedule it. Each post took 45-60 minutes. I wanted to automate the workflow but quickly hit a wall: a single prompt can't do all of this well. Asking one LLM to research, write, validate, and generate visuals produces mediocre results across the board.
The fix was splitting the work into specialized agents — each one focused on doing one thing well.
The Architecture
Topic Research --> Writing --> Validation --> Visual Generation
^ |
| v
+--- Feedback Loop (if score below threshold)
The Orchestrator manages the full pipeline. If a post scores below the quality threshold, it loops back to the Writing Agent with specific feedback about what to fix. Best-of-N tracking means even if all rewrites fail, the system keeps the highest-scoring attempt, not the last one.
The Agents
Each agent inherits from a BaseAgent class that provides Claude API calls with retry logic, model fallback, usage tracking, and observability.
class BaseAgent:
def __init__(self, anthropic_client, config):
self.client = anthropic_client
self.model = config.get("model", "claude-sonnet-4-20250514")
def _call_claude(self, system_prompt, user_prompt, max_tokens=4096, model=None):
use_model = model or self.model
response = self._create_message_with_retry(
model=use_model,
max_tokens=max_tokens,
system=system_prompt,
messages=[{"role": "user", "content": user_prompt}],
)
return response.content[0].text
Here's what each agent does:
TopicResearchAgent
Scans codebases or topic descriptions and extracts post-worthy topics with virality scoring across multiple dimensions — how controversial is the take, how strong is the pain point, how actionable is the advice.
WritingAgent
Creates posts using multiple hook patterns and supports different audience types. The key design: voice controls tone and style, while audience controls substance and vocabulary. A developer post gets tool names and configs. A leadership post gets team sizes and org decisions. A vocabulary boundary prevents jargon from leaking across audiences.
ValidationAgent
Scores posts 0-100 using multiple evaluators that each measure a different quality dimension — value delivery, hook effectiveness, structure, voice authenticity, and engagement potential. Some evaluators act as hard gates: a well-written post about the wrong topic gets its score capped regardless of how good the writing is.
VisualAgent
Generates 1080x1080 infographics via Gemini, with content analysis, text validation, and image generation in a multi-step pipeline.
VisualValidationAgent
Validates visuals using GPT-4o Vision — not Claude, to avoid self-preference bias. Includes a calibration curve and quality gates, plus a deterministic color diversity check that catches broken renders AI judges miss.
SchedulingAgent
Generates optimal posting schedules based on the user's timezone with minimum spacing between posts.
CarouselAgent
Multi-slide LinkedIn carousels — Gemini decomposes the content, Playwright renders slides, output is a compressed PDF.
SourceExtractionAgent
Extracts relevant code snippets from GitHub/GitLab repos to give the Writing Agent real code examples to reference.
The Feedback Loop
This is where it gets interesting. The Orchestrator doesn't just run agents sequentially — it runs a validation loop:
# Simplified orchestrator loop
for attempt in range(max_rewrite_attempts):
draft = writing_agent.write_post(topic, hook_style)
validation = validation_agent.validate(draft)
if validation.score >= min_score:
break # Good enough
# Track the best attempt (not just the last one)
if validation.score > best_score:
best_draft = draft
best_score = validation.score
# Compile targeted feedback
feedback = compile_feedback(validation, previous_feedback)
draft = writing_agent.rewrite_post(draft, feedback)
Two key design decisions:
Best-result tracking: If the rewrite loop produces scores of 72, 68, 71 — the system uses the 72-scoring draft, not the 71. Without this, rewrites can regress.
Feedback deduplication: The feedback compiler separates "STILL BROKEN" issues (repeated across attempts) from new issues. This prevents the Writing Agent from getting the same generic feedback three times.
LLM-as-Judge: What I Learned
Using Claude to judge Claude's output taught me a few things the hard way.
Few-shot calibration is essential. A prompt like "rate this post 0-100" produces compressed scores around 70-80. Every evaluator needs calibration examples showing what different score levels actually look like. This spread the scores from a narrow band to a genuine 20-95 range.
Self-preference bias is real. Claude rating Claude's visual output inflated scores. Switching to GPT-4o for visual validation fixed it immediately. If you're building LLM-as-judge systems, use a different model family than the one generating the output.
Deterministic checks complement AI judges. A color diversity check using Pillow catches broken Gemini renders that both Claude and GPT-4o rate as "acceptable." Sometimes a simple pixel count beats two frontier models.
Hard gates prevent gaming. A post about the wrong topic scores high on all other metrics because the writing is good. Fidelity checks must cap the blended score, not just penalize it — otherwise great writing about the wrong topic still passes.
The Tech Stack
- Python 3.12 + FastAPI for the API
- Anthropic Claude for writing and evaluation
- Google Gemini for visual and carousel generation
- GPT-4o for visual validation (avoids self-preference bias)
- Celery + RabbitMQ for async task processing
- MongoDB for content storage
- Redis for caching and task results
- LangSmith for tracing, evaluation datasets, and prompt management
- React + Vite for the frontend
Cost efficiency was a priority from day one. Using lighter models for classification tasks (like CTA detection and structural checks) and reserving the more capable models for actual writing and evaluation keeps the cost well under a dollar per post — including visuals.
What I'd Do Differently
Start with evaluation before writing. I built the Writing Agent first, then realized I had no way to measure if changes improved output. Building the evaluators first would have shortened the feedback cycle significantly.
Use different models for judging. Self-preference bias wasted a week of debugging. If your system generates output with one model, validate it with another.
Design for the rewrite loop from day one. The feedback loop between validation and rewriting is where most of the quality comes from. A single-shot prompt gets you decent output. The rewrite loop is what makes it publication-ready.
Try It
The system is live at postsmith.io. Free tier gives you 6 posts per month. If you're interested in the multi-agent architecture or have questions about building AI evaluation systems, drop a comment — happy to go deeper on any part of this.
Built with Python, Claude, and too many late nights debugging why Gemini keeps putting text in the wrong place on infographics.
Top comments (0)