<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Patrick Lou</title>
    <description>The latest articles on DEV Community by Patrick Lou (@patrick_lou_ecb75a1421ba6).</description>
    <link>https://dev.to/patrick_lou_ecb75a1421ba6</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2176627%2F4ca9f3fe-644a-44dd-86a4-4e763a8915ee.jpg</url>
      <title>DEV Community: Patrick Lou</title>
      <link>https://dev.to/patrick_lou_ecb75a1421ba6</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/patrick_lou_ecb75a1421ba6"/>
    <language>en</language>
    <item>
      <title>How I Built a Multi-Agent AI Pipeline with Python and Claude</title>
      <dc:creator>Patrick Lou</dc:creator>
      <pubDate>Sun, 05 Apr 2026 22:34:38 +0000</pubDate>
      <link>https://dev.to/patrick_lou_ecb75a1421ba6/how-i-built-a-multi-agent-ai-pipeline-with-python-and-claude-12a3</link>
      <guid>https://dev.to/patrick_lou_ecb75a1421ba6/how-i-built-a-multi-agent-ai-pipeline-with-python-and-claude-12a3</guid>
      <description>&lt;p&gt;I spent the last few months building a system where 8 AI agents collaborate to turn a topic into a publication-ready LinkedIn post. Not a single prompt — a pipeline where each agent has one job, and they pass work between each other with feedback loops.&lt;/p&gt;

&lt;p&gt;Here's how it works and what I learned.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;I was writing LinkedIn posts manually. Research a topic, write the draft, edit it, find hashtags, create a visual, schedule it. Each post took 45-60 minutes. I wanted to automate the workflow but quickly hit a wall: a single prompt can't do all of this well. Asking one LLM to research, write, validate, and generate visuals produces mediocre results across the board.&lt;/p&gt;

&lt;p&gt;The fix was splitting the work into specialized agents — each one focused on doing one thing well.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Topic Research --&amp;gt; Writing --&amp;gt; Validation --&amp;gt; Visual Generation
                     ^             |
                     |             v
                     +--- Feedback Loop (if score below threshold)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;Orchestrator&lt;/code&gt; manages the full pipeline. If a post scores below the quality threshold, it loops back to the Writing Agent with specific feedback about what to fix. Best-of-N tracking means even if all rewrites fail, the system keeps the highest-scoring attempt, not the last one.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Agents
&lt;/h2&gt;

&lt;p&gt;Each agent inherits from a &lt;code&gt;BaseAgent&lt;/code&gt; class that provides Claude API calls with retry logic, model fallback, usage tracking, and observability.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;BaseAgent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;anthropic_client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;anthropic_client&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_call_claude&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;use_model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_create_message_with_retry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;use_model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;system&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_prompt&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's what each agent does:&lt;/p&gt;

&lt;h3&gt;
  
  
  TopicResearchAgent
&lt;/h3&gt;

&lt;p&gt;Scans codebases or topic descriptions and extracts post-worthy topics with virality scoring across multiple dimensions — how controversial is the take, how strong is the pain point, how actionable is the advice.&lt;/p&gt;

&lt;h3&gt;
  
  
  WritingAgent
&lt;/h3&gt;

&lt;p&gt;Creates posts using multiple hook patterns and supports different audience types. The key design: voice controls tone and style, while audience controls substance and vocabulary. A developer post gets tool names and configs. A leadership post gets team sizes and org decisions. A vocabulary boundary prevents jargon from leaking across audiences.&lt;/p&gt;

&lt;h3&gt;
  
  
  ValidationAgent
&lt;/h3&gt;

&lt;p&gt;Scores posts 0-100 using multiple evaluators that each measure a different quality dimension — value delivery, hook effectiveness, structure, voice authenticity, and engagement potential. Some evaluators act as hard gates: a well-written post about the wrong topic gets its score capped regardless of how good the writing is.&lt;/p&gt;

&lt;h3&gt;
  
  
  VisualAgent
&lt;/h3&gt;

&lt;p&gt;Generates 1080x1080 infographics via Gemini, with content analysis, text validation, and image generation in a multi-step pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  VisualValidationAgent
&lt;/h3&gt;

&lt;p&gt;Validates visuals using GPT-4o Vision — not Claude, to avoid self-preference bias. Includes a calibration curve and quality gates, plus a deterministic color diversity check that catches broken renders AI judges miss.&lt;/p&gt;

&lt;h3&gt;
  
  
  SchedulingAgent
&lt;/h3&gt;

&lt;p&gt;Generates optimal posting schedules based on the user's timezone with minimum spacing between posts.&lt;/p&gt;

&lt;h3&gt;
  
  
  CarouselAgent
&lt;/h3&gt;

&lt;p&gt;Multi-slide LinkedIn carousels — Gemini decomposes the content, Playwright renders slides, output is a compressed PDF.&lt;/p&gt;

&lt;h3&gt;
  
  
  SourceExtractionAgent
&lt;/h3&gt;

&lt;p&gt;Extracts relevant code snippets from GitHub/GitLab repos to give the Writing Agent real code examples to reference.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Feedback Loop
&lt;/h2&gt;

&lt;p&gt;This is where it gets interesting. The Orchestrator doesn't just run agents sequentially — it runs a validation loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified orchestrator loop
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;attempt&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_rewrite_attempts&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;draft&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;writing_agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write_post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;hook_style&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;validation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;validation_agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;validate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;draft&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;validation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;min_score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;break&lt;/span&gt;  &lt;span class="c1"&gt;# Good enough
&lt;/span&gt;
    &lt;span class="c1"&gt;# Track the best attempt (not just the last one)
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;validation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;best_score&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;best_draft&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;draft&lt;/span&gt;
        &lt;span class="n"&gt;best_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;validation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;

    &lt;span class="c1"&gt;# Compile targeted feedback
&lt;/span&gt;    &lt;span class="n"&gt;feedback&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compile_feedback&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;validation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;previous_feedback&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;draft&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;writing_agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rewrite_post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;draft&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;feedback&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two key design decisions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best-result tracking&lt;/strong&gt;: If the rewrite loop produces scores of 72, 68, 71 — the system uses the 72-scoring draft, not the 71. Without this, rewrites can regress.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Feedback deduplication&lt;/strong&gt;: The feedback compiler separates "STILL BROKEN" issues (repeated across attempts) from new issues. This prevents the Writing Agent from getting the same generic feedback three times.&lt;/p&gt;

&lt;h2&gt;
  
  
  LLM-as-Judge: What I Learned
&lt;/h2&gt;

&lt;p&gt;Using Claude to judge Claude's output taught me a few things the hard way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Few-shot calibration is essential.&lt;/strong&gt; A prompt like "rate this post 0-100" produces compressed scores around 70-80. Every evaluator needs calibration examples showing what different score levels actually look like. This spread the scores from a narrow band to a genuine 20-95 range.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self-preference bias is real.&lt;/strong&gt; Claude rating Claude's visual output inflated scores. Switching to GPT-4o for visual validation fixed it immediately. If you're building LLM-as-judge systems, use a different model family than the one generating the output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deterministic checks complement AI judges.&lt;/strong&gt; A color diversity check using Pillow catches broken Gemini renders that both Claude and GPT-4o rate as "acceptable." Sometimes a simple pixel count beats two frontier models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hard gates prevent gaming.&lt;/strong&gt; A post about the wrong topic scores high on all other metrics because the writing is good. Fidelity checks must cap the blended score, not just penalize it — otherwise great writing about the wrong topic still passes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tech Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Python 3.12 + FastAPI&lt;/strong&gt; for the API&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Anthropic Claude&lt;/strong&gt; for writing and evaluation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Google Gemini&lt;/strong&gt; for visual and carousel generation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPT-4o&lt;/strong&gt; for visual validation (avoids self-preference bias)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Celery + RabbitMQ&lt;/strong&gt; for async task processing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MongoDB&lt;/strong&gt; for content storage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Redis&lt;/strong&gt; for caching and task results&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LangSmith&lt;/strong&gt; for tracing, evaluation datasets, and prompt management&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;React + Vite&lt;/strong&gt; for the frontend&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Cost efficiency was a priority from day one. Using lighter models for classification tasks (like CTA detection and structural checks) and reserving the more capable models for actual writing and evaluation keeps the cost well under a dollar per post — including visuals.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd Do Differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Start with evaluation before writing.&lt;/strong&gt; I built the Writing Agent first, then realized I had no way to measure if changes improved output. Building the evaluators first would have shortened the feedback cycle significantly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use different models for judging.&lt;/strong&gt; Self-preference bias wasted a week of debugging. If your system generates output with one model, validate it with another.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Design for the rewrite loop from day one.&lt;/strong&gt; The feedback loop between validation and rewriting is where most of the quality comes from. A single-shot prompt gets you decent output. The rewrite loop is what makes it publication-ready.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;The system is live at &lt;a href="https://postsmith.io" rel="noopener noreferrer"&gt;postsmith.io&lt;/a&gt;. Free tier gives you 6 posts per month. If you're interested in the multi-agent architecture or have questions about building AI evaluation systems, drop a comment — happy to go deeper on any part of this.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built with Python, Claude, and too many late nights debugging why Gemini keeps putting text in the wrong place on infographics.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>anthropic</category>
      <category>programming</category>
    </item>
  </channel>
</rss>
