TechPulse Lab

Posted on Apr 28

How I'd build a 3-agent AI system in a weekend without losing my mind

#tutorial #ai #automation #productivity

Most of the "multi-agent" content I read on here right now is either (a) toy demos with three LLMs taking turns talking to each other, or (b) frameworks that ship a runtime, a DSL, a vector store, a planner, and a graph compiler before you've answered the only question that matters: what work is the system actually doing.

I've spent the last six months running a small fleet of agents in production — a content pipeline, a triage system, a research-and-summarise loop. None of them are flashy. All of them save more hours per week than they cost me to maintain. And every one of them follows the same three-role pattern: orchestrator + 2 specialists + 1 critic (the critic is optional but usually worth it).

This post is the build I'd do if you handed me a weekend and said "make a 3-agent system that actually works." No framework worship. No vendor lock-in. Just the questions, the file layout, and the failure modes that bit me the first three times.

Step 1: Pick the work, not the agents

The single biggest mistake I see in agent posts on dev.to is starting from the agent count. "I'm going to build a 3-agent system" is not a project, it's a costume. Start from the work.

Write one paragraph answering this:

Every <frequency>, I have to take <input>, do <sequence of decisions>, and produce <output>. The decisions involve <judgment / lookups / tooling>. It currently takes me <minutes/hours> and I do it <N> times per <period>.

If you can't fill that in, no agent is going to fix anything. If you can, the agent boundaries usually fall out of the sentence:

The input suggests an ingester.
The decisions suggest a worker (or two, if there are clearly distinct judgment types).
The output suggests a publisher / writer / critic.
The sequence suggests an orchestrator.

For the rest of this post I'll use a concrete example: a daily competitive intel briefing. Input = three RSS feeds + one Twitter list. Decision = which items matter, what's the angle, who already covered it. Output = a 200-word email by 8am.

Step 2: Carve the roles before you write code

Give each agent a one-sentence job description and a hard boundary. If you can't say what an agent won't do, it'll do everything badly.

For the briefing example:

Ingest agent. Pulls every item published since yesterday's run. Dedupes by URL. Does not judge importance. Output: a JSON list of {title, url, source, summary}.
Analyst agent. Scores each item for relevance to a fixed set of themes. Drops anything below a threshold. Does not write copy. Output: a ranked JSON list with a one-sentence "why it matters" per item.
Editor agent. Takes the top 5 items and writes the briefing. Does not re-evaluate the analyst's picks (this is a discipline thing — let the previous agent's decisions stand or you'll loop forever). Output: markdown email.
Orchestrator. Runs the pipeline on a cron. Handles retries, logs each handoff, and refuses to publish if any stage produced empty or malformed output.

Four roles total — orchestrator + 3 specialists. You can compress to 2 specialists for simpler work; I rarely go above 4 because the coordination overhead starts eating the wins.

Step 3: One file per agent, plain markdown for everything else

Resist the framework. For a weekend build, you want:

/agents
  ingest.md       # role description + tools allowed
  analyst.md
  editor.md
/memory
  themes.md       # hand-edited list of topics that matter
  yesterday.json  # last run's output, for dedup
  briefings/      # archive of past outputs
/run.py           # the orchestrator

Each *.md is the system prompt for that agent. Plain English. The themes file is hand-edited. The orchestrator is a 60-line Python script that calls each agent in sequence with a structured-output schema and writes the intermediate state to disk between every step.

I know this looks crude. That's the point. The framework people will tell you that you need a graph compiler and a typed message bus. You don't, not yet. You need every intermediate state on disk so that when something goes sideways at 6am Tuesday you can cat memory/run-2026-04-28/02-analyst.json and see exactly what the analyst saw.

Step 4: Structured output everywhere, even when it hurts

Most "my agents kept hallucinating" posts trace back to free-text handoffs. The ingest agent says "I found 12 items" in prose, the analyst tries to parse the prose, and the wheels come off.

Non-negotiable rule: every agent returns JSON that conforms to a schema you defined before you wrote the agent. If your model supports tool calls or structured outputs, use them. If not, ask for JSON inside fenced code blocks and validate with json.loads + a schema check. If validation fails, retry once with the error message appended, then fail loudly. Don't paper over it.

This is the difference between a system that runs unattended for 60 days and one that wakes you up every Wednesday.

Step 5: Memory is just files. Stop overthinking it.

For a weekend build, you do not need a vector database. You need three things:

A themes file the analyst reads every run. Hand-edited markdown. Updated when your priorities shift.
A dedup file so you don't briefing-the-same-news twice. Just a JSON list of URLs from the last 7 days.
An archive of past outputs. The orchestrator drops a copy of each briefing into memory/briefings/YYYY-MM-DD.md.

That's it. I've run agent systems for months on this layout. The day you genuinely need semantic retrieval — meaning you have so many past artifacts that grep no longer finds the right one — you'll know, and adding a vector store at that point is a one-day job. Don't add it on day one for an imaginary problem.

Step 6: The critic agent (optional, usually worth it)

If the output goes anywhere a human will read it (email, Slack, blog, customer), add a critic. The critic agent reads the editor's output and answers three questions:

Does this look like a draft I'd be happy to publish under my name?
Are there factual claims I can't verify from the analyst's input?
Is there anything in here that would embarrass me?

The critic doesn't rewrite. It returns a single field — pass or fail with a reason. On fail, the orchestrator kicks the editor once with the critic's note appended. On a second fail, it ships the draft with a flag for human review and an email to you.

This is the cheapest quality jump in the entire system. It costs you maybe $0.02 per run and it catches the 1-in-30 output that would have been embarrassing.

Step 7: The failure modes that will bite you

In rough order of how badly each one hurt me:

Silent partial failures. Stage 2 returns an empty list, stage 3 dutifully turns it into an empty briefing, you ship a blank email. Fix: every stage validates non-empty output before handoff. Empty = fail loudly.
Cost drift. You wire it up, ship it, and three weeks later realise the analyst is summarising every item twice because the prompt includes "think step by step" in two places. Fix: log token counts per stage per run. Look at them weekly. The graph should be flat.
Theme rot. The themes file you wrote on day one stops matching what you actually care about by week three. The agent gets technically correct, behaviourally useless. Fix: re-read themes.md every Sunday. Edit it. This is the only manual maintenance the system needs and it's worth doing.
Looping critics. A critic that's allowed to reject indefinitely will reject indefinitely. Fix: hard cap at one rewrite, then ship-with-flag.
The orchestrator becoming an agent. Tempting to give the orchestrator "intelligence" — let it decide whether to skip the editor today. Don't. Once the orchestrator has judgment, you have a four-agent system pretending to be three, and the debugging gets twice as hard. Keep the orchestrator deterministic. If you need a planning step, that's a planner agent, called explicitly from the orchestrator.

What this looks like by Sunday night

If you start Saturday morning with the work-paragraph and pick something you actually need, by Sunday evening you should have:

4 markdown files describing the roles (≈300 words each).
A run.py that reads them, calls the model with structured outputs, and writes intermediate state to disk.
A cron entry or a launchd/Task Scheduler job that runs it daily.
One real output you'd actually send to yourself.

That's the minimum viable agent system. It's also — in my experience — about 80% of what teams actually need. The remaining 20% is tool integrations and domain-specific judgment, and you only know what you need after the v1 has been running for two weeks.

When to stop building it yourself

The weekend build above is the right move when (a) the work is yours, (b) you'll personally maintain it, and (c) you enjoy the tinkering. It's the wrong move when the work belongs to a team, the system needs to integrate with five internal tools you don't own, and a missed run costs real money.

For that case — multi-agent setup, specialist agents wired into your actual stack, with the routing, memory, and stress-testing already done — there's a pre-packaged version of exactly this build at AI Armory's Business AI Command Center. Six pre-built packages (social media, customer ops, content factory, intelligence dashboard, dev operations, sales engine) plus custom builds, 2-hour workshop, 10 hours of hands-on build work, 45-day support, full documentation. Same architecture I described above, productionised.

But you don't need that to start. You need a paragraph describing the work, four markdown files, and a Saturday.

Go build the v1. Email me the briefing on Monday.

DEV Community