Patrick Lou

Posted on Apr 5

How I Built a Multi-Agent AI Pipeline with Python and Claude

#python #ai #anthropic #programming

I spent the last few months building a system where 8 AI agents collaborate to turn a topic into a publication-ready LinkedIn post. Not a single prompt — a pipeline where each agent has one job, and they pass work between each other with feedback loops.

Here's how it works and what I learned.

The Problem

I was writing LinkedIn posts manually. Research a topic, write the draft, edit it, find hashtags, create a visual, schedule it. Each post took 45-60 minutes. I wanted to automate the workflow but quickly hit a wall: a single prompt can't do all of this well. Asking one LLM to research, write, validate, and generate visuals produces mediocre results across the board.

The fix was splitting the work into specialized agents — each one focused on doing one thing well.

The Architecture

Topic Research --> Writing --> Validation --> Visual Generation
                     ^             |
                     |             v
                     +--- Feedback Loop (if score below threshold)

The Orchestrator manages the full pipeline. If a post scores below the quality threshold, it loops back to the Writing Agent with specific feedback about what to fix. Best-of-N tracking means even if all rewrites fail, the system keeps the highest-scoring attempt, not the last one.

The Agents

Each agent inherits from a BaseAgent class that provides Claude API calls with retry logic, model fallback, usage tracking, and observability.

class BaseAgent:
    def __init__(self, anthropic_client, config):
        self.client = anthropic_client
        self.model = config.get("model", "claude-sonnet-4-20250514")

    def _call_claude(self, system_prompt, user_prompt, max_tokens=4096, model=None):
        use_model = model or self.model
        response = self._create_message_with_retry(
            model=use_model,
            max_tokens=max_tokens,
            system=system_prompt,
            messages=[{"role": "user", "content": user_prompt}],
        )
        return response.content[0].text

Here's what each agent does:

TopicResearchAgent

Scans codebases or topic descriptions and extracts post-worthy topics with virality scoring across multiple dimensions — how controversial is the take, how strong is the pain point, how actionable is the advice.

WritingAgent

Creates posts using multiple hook patterns and supports different audience types. The key design: voice controls tone and style, while audience controls substance and vocabulary. A developer post gets tool names and configs. A leadership post gets team sizes and org decisions. A vocabulary boundary prevents jargon from leaking across audiences.

ValidationAgent

Scores posts 0-100 using multiple evaluators that each measure a different quality dimension — value delivery, hook effectiveness, structure, voice authenticity, and engagement potential. Some evaluators act as hard gates: a well-written post about the wrong topic gets its score capped regardless of how good the writing is.

VisualAgent

Generates 1080x1080 infographics via Gemini, with content analysis, text validation, and image generation in a multi-step pipeline.

VisualValidationAgent

Validates visuals using GPT-4o Vision — not Claude, to avoid self-preference bias. Includes a calibration curve and quality gates, plus a deterministic color diversity check that catches broken renders AI judges miss.

SchedulingAgent

Generates optimal posting schedules based on the user's timezone with minimum spacing between posts.

CarouselAgent

Multi-slide LinkedIn carousels — Gemini decomposes the content, Playwright renders slides, output is a compressed PDF.

SourceExtractionAgent

Extracts relevant code snippets from GitHub/GitLab repos to give the Writing Agent real code examples to reference.

The Feedback Loop

This is where it gets interesting. The Orchestrator doesn't just run agents sequentially — it runs a validation loop:

# Simplified orchestrator loop
for attempt in range(max_rewrite_attempts):
    draft = writing_agent.write_post(topic, hook_style)
    validation = validation_agent.validate(draft)

    if validation.score >= min_score:
        break  # Good enough

    # Track the best attempt (not just the last one)
    if validation.score > best_score:
        best_draft = draft
        best_score = validation.score

    # Compile targeted feedback
    feedback = compile_feedback(validation, previous_feedback)
    draft = writing_agent.rewrite_post(draft, feedback)

Two key design decisions:

Best-result tracking: If the rewrite loop produces scores of 72, 68, 71 — the system uses the 72-scoring draft, not the 71. Without this, rewrites can regress.

Feedback deduplication: The feedback compiler separates "STILL BROKEN" issues (repeated across attempts) from new issues. This prevents the Writing Agent from getting the same generic feedback three times.

LLM-as-Judge: What I Learned

Using Claude to judge Claude's output taught me a few things the hard way.

Few-shot calibration is essential. A prompt like "rate this post 0-100" produces compressed scores around 70-80. Every evaluator needs calibration examples showing what different score levels actually look like. This spread the scores from a narrow band to a genuine 20-95 range.

Self-preference bias is real. Claude rating Claude's visual output inflated scores. Switching to GPT-4o for visual validation fixed it immediately. If you're building LLM-as-judge systems, use a different model family than the one generating the output.

Deterministic checks complement AI judges. A color diversity check using Pillow catches broken Gemini renders that both Claude and GPT-4o rate as "acceptable." Sometimes a simple pixel count beats two frontier models.

Hard gates prevent gaming. A post about the wrong topic scores high on all other metrics because the writing is good. Fidelity checks must cap the blended score, not just penalize it — otherwise great writing about the wrong topic still passes.

The Tech Stack

Python 3.12 + FastAPI for the API
Anthropic Claude for writing and evaluation
Google Gemini for visual and carousel generation
GPT-4o for visual validation (avoids self-preference bias)
Celery + RabbitMQ for async task processing
MongoDB for content storage
Redis for caching and task results
LangSmith for tracing, evaluation datasets, and prompt management
React + Vite for the frontend

Cost efficiency was a priority from day one. Using lighter models for classification tasks (like CTA detection and structural checks) and reserving the more capable models for actual writing and evaluation keeps the cost well under a dollar per post — including visuals.

What I'd Do Differently

Start with evaluation before writing. I built the Writing Agent first, then realized I had no way to measure if changes improved output. Building the evaluators first would have shortened the feedback cycle significantly.

Use different models for judging. Self-preference bias wasted a week of debugging. If your system generates output with one model, validate it with another.

Design for the rewrite loop from day one. The feedback loop between validation and rewriting is where most of the quality comes from. A single-shot prompt gets you decent output. The rewrite loop is what makes it publication-ready.

Try It

The system is live at postsmith.io. Free tier gives you 6 posts per month. If you're interested in the multi-agent architecture or have questions about building AI evaluation systems, drop a comment — happy to go deeper on any part of this.

Built with Python, Claude, and too many late nights debugging why Gemini keeps putting text in the wrong place on infographics.

DEV Community