Jesús Bosch Ayguadé

Posted on May 9

Building an autonomous multi-agent SEO system with Claude + GitHub Actions ("cheap" and quick)

#ai #claude #github #agents

I needed SEO content for a Chrome extension I built. The arithmetic of doing it myself was discouraging: six to eight hours per post if I wanted it to be good, plus another two for translations into Catalan and Spanish (I know, I know). I was not going to write four posts a week. I was not going to write one post a week reliably either.

I knew AI Agents could come to the rescue, but those usually require an expensive token economy... and I'm on a budget, so I had to find creative ways to get the work done without breaking the bank.

So I came up with a system that does it. Seven Claude-backed agents orchestrated through five GitHub Actions workflows, with the human only stepping in to merge a PR (or to ignore it). It runs at around $1.50 per article in three languages, including custom SVG diagrams and Open Graph cards. At my publishing cadence (one post a week), the API spend is roughly $7 a month.

Below is what is actually inside it, with the design choices that made it cheap.

The five flows

Each flow is a workflow file in the consumer-site repo. The agent code itself lives in a separate Python package; each workflow installs it via pip install git+...@main and calls a CLI entry point. That separation means I iterate prompts without touching workflows, and bumping the agent version is a one-line change in the workflow file.

Flow A (weekly content): cron-triggered. Strategist picks the next pending topic from a hand-curated backlog, Writer produces an MDX post, Editor accepts or rejects with structured feedback, retry loop up to three times. Image Generator produces a thumbnail, an OG image, and one to three inline SVG diagrams. Build verification, PR opened with a content-review label, I get an email.
Flow B (PR feedback): triggered when I comment @seo-agent ... on the open PR. Writer regenerates the MDX with my instruction. No Editor here, because I am the reviewer at this point.
Flow C (technical SEO audit): issue-triggered. Technical Analyst reads a Search Console warning I paste in, decides if it is a false positive, opens a fix PR only if not.
Flow D (monthly topic research): cron-triggered. Topic Generator uses Anthropic's server-side web_search to read Reddit, indie-developer forums, and competitor blogs, then proposes five to six topics aligned with my reader profile. The PR auto-merges and the backlog refills.
Flow E (post-merge translation): triggered when I merge Flow A's PR. Translator produces Catalan and Spanish versions, Image Generator translates body-diagram SVG labels per locale, the translation PR opens and auto-merges in four seconds.

Only Flow A and Flow C produce PRs that need my review. Flow D and Flow E auto-merge because the human gate is upstream (I curate the topic in D, I approved the EN draft in A before E fired).

Per-agent model selection (where the savings come from)

Most multi-agent setups I have read about use one model for everything. That is the wrong abstraction. Each agent has a different job, and each job has a different cost-quality sweet spot.

STRATEGIST_MODEL = "claude-haiku-4-5"        # JSON in, JSON out, cheap
WRITER_MODEL = "claude-opus-4-7"             # long-form prose, voice matters
EDITOR_LLM_MODEL = "claude-haiku-4-5"        # one binary judgment
TECHNICAL_ANALYST_MODEL = "claude-sonnet-4-6"
TOPIC_GENERATOR_MODEL = "claude-sonnet-4-6"  # needs web_search (no Haiku support)
TRANSLATOR_MODEL = "claude-sonnet-4-6"       # structure-preserving translation
IMAGE_GENERATOR_MODEL = "claude-sonnet-4-6"  # SVG with strict layout rules

Real per-post breakdown from the runs log:

Agent	Model	Cost
Strategist	Haiku 4.5	$0.01
Writer (1 iteration)	Opus 4.7	$0.60-0.75
Editor (1 LLM call)	Haiku 4.5	$0.01
Image Generator (thumbnail + OG + 1 diagram)	Sonnet 4.6	$0.10
Flow A subtotal		~$0.76
Translator x 2 locales	Sonnet 4.6	$0.20
OG localized x 2	Sonnet 4.6	$0.04
Diagram translate x 2 locales	Sonnet 4.6	$0.14
Flow E subtotal		~$0.38
Topic Generator (monthly, amortized over 4 posts)	Sonnet 4.6 + web_search	$0.22
Total per post (three languages, all assets)		~$1.36

Writer cost is 60-70% of the total. I tried Sonnet 4.6 first; the prose was technically fine but read formulaic, with the kind of "X takes care of itself" tics that betray automated generation. Opus produces a noticeably more idiomatic editorial voice. At the volume I publish, the absolute monthly delta is around $5. Worth it.

Prompt caching across the Writer-Editor loop

The Writer can need two or three iterations to pass the Editor. The brief, the voice guide, the product doc, and the component schemas are identical across iterations. Naively re-sending the whole prompt every iteration means paying full input rate three times.

Anthropic's prompt caching solves this. The Writer call splits its prompt into a stable prefix (cached) and a dynamic suffix (the editor feedback and previous draft):

result = client.complete(
    system=WRITER_SYSTEM_PROMPT,
    user=_build_dynamic_suffix(inputs),
    cached_user_prefix=_build_cached_prefix(inputs),
    max_tokens=8192,
    model=WRITER_MODEL,
)

The cached prefix is sent once and reused for about five minutes. Iterations 2 and 3 pay roughly 10% of the full input rate on those tokens. On a real post that translates to $0.10-0.15 saved per retry, and Opus retries are not cheap.

The Editor: regex first, Claude only when needed

The Editor runs around 21 deterministic checks before invoking any LLM:

Frontmatter: date is a quoted ISO string, title under 60 chars, excerpt under 160 chars, slug is strict kebab-case.
Structure: TldrBlock immediately after intro, anchor links resolve to H2 ids, components match the brief, visual variety (no two consecutive prose-only sections), pull-quote count between one and three, section word counts within tolerance.
Body content rules: no meta-instruction phrases ("here is how", "in this post we"), no SDR-style EmailMockup copy ("discovery call", "sales sequence"), mandatory Mail2Follow CTA block at the end, product link present at least once.
Voice and brand: banned-phrase list clean (no "leverage", no em dashes, etc.), no founder names in EmailMockup signatures, no cross-promotion of other products.

These cost nothing. They run on the Writer's output before any Claude call, and they catch about 80% of the failure modes the Writer produces.

After the deterministic checks, one LLM call to Haiku 4.5 (around $0.01) handles what regex cannot judge: fabricated numeric claims and fabricated personal anecdotes. The check returns structured JSON the pipeline injects into the next Writer iteration as a specific instruction.

This regex-first design is why the Editor is essentially free per pass. Most multi-agent systems I have read about use an LLM critic for everything, which works, but it costs 30-50x more for the same job.

The retry loop with structured feedback

When the Editor rejects, the Writer is called again with the original brief, the previous draft, and a list of failed checks. Each FailedCheck carries a specific, actionable instruction:

FailedCheck(
    check="no_sensory_anecdotes",
    instruction=(
        "The body contains 'a reform contractor I know in Sant Cugat'. "
        "The agent cannot have first-hand sensory or biographical "
        "anecdotes. Rewrite as an archetypal scenario."
    ),
)

The Writer reads these in its prompt as a numbered list of items to fix. Up to three iterations. If the Editor still rejects after retry three, the workflow opens a GitHub issue with the failed drafts and exits non-zero. I see the issue in my inbox the next morning.

Anti-fabrication rules

Early in development the Writer was producing prose like "A reform contractor I know in Sant Cugat sent five quotes one Tuesday in March". Compelling reading. Also entirely invented. The Writer has no body, did not visit places, did not meet contractors. Pretending otherwise is an ethical line.

The current rule:

The Writer cannot use first-person sensory or biographical constructions ("I saw", "I drove past", "told me over coffee", "a [role] I know in [place]").
Replace with archetypes: "Consider a freelance architect who sends four proposals a month..."

Deterministic regex catches the obvious patterns; the LLM check catches subtler fabrications.

This is the rule that took the most calibration. The rules that caused the Writer to fabricate were the rules I had written months earlier ("first-person founder voice, use 'I' and 'my', reference real founder experience") to make the prose more vivid. The vividness came from invention. Removing the mandate produced posts that read more reserved but truthful.

Cultural translation, not literal

The Translator started by producing word-perfect Catalan and Spanish that no native speaker would actually write. "Just checking in" became "només volia saber com estàs" — a calque from English that comes out of Google Translate, never a Catalan inbox.

The fix was an idiom-substitution table embedded in the Translator prompt:

| English source        | Catalan         | Spanish         |
|---|---|---|
| just checking in      | com anem?       | qué tal?        |
| circling back         | et torno a      | vuelvo a        |
|                       | escriure        | escribirte      |
| touch base            | fer un toc      | ponernos en     |
|                       |                 | contacto        |
| outreach              | captació        | captación       |
| chase an invoice      | reclamar una    | reclamar una    |
|                       | factura         | factura         |

Plus an explicit instruction: the model is authorized to deviate from the literal source when a literal translation would produce a calque. The constraint is faithfulness to the post's intent, not to the source's word order.

Body diagrams (SVG with text labels) get a separate translation pass: an Image Generator function reads the EN SVG and rewrites only the <text> content into the target locale, preserving every other byte (geometry, classes, fills, transforms). The Catalan post points at <localized-slug>-<section>.svg, the Spanish post at its own variant.

Three things that surprised me in production

MDX paragraph-wrapping breaks layout in ways HTML does not. The CTA pill at the bottom of every post is <a class="cta-pill"><span>Try Mail2Follow free</span><svg .../></a>. With the text on a separate line in the source, the MDX compiler wraps it in a <p> element, which is block-level, which breaks the inline-flex layout and pushes the chevron to its own line. Reference posts authored in raw .md were unaffected. Agent-authored .mdx posts hit it on every build until I switched the anchor contents to a single line and wrapped the label in a <span> as a second line of defence.

Slug paths with hyphens trip greedy regex. The body-diagram src is /blog-images/diagrams/<slug>-<section>.svg. Both slug and section are kebab-case strings; a single regex over both cannot determine the split (follow-up-quote-contractor-no-response-five-day-window could split anywhere there is a hyphen). The fix is to compile the regex per call against the actual source slug as a literal:

diagram_re = re.compile(
    rf"/blog-images/diagrams/{re.escape(source_slug)}-(?P<section>[a-z0-9-]+)\.svg"
)

I caught this on the first real run with body-diagram translation enabled. The pipeline silently produced no localized diagrams because the regex captured the wrong split, the slug filter rejected every match, and the loop iterated zero times. The runs log showed zero diagram_translate entries; the symptom was the absence of expected work.

The cheapest failure to recover from is the one that opens a tracking issue. The pipeline catches AnthropicClientError around each agent call and opens a GitHub issue with the brief, the iteration index, and the error message. When Anthropic deprecated temperature on Opus 4.7, the workflow exited with a clean issue titled "[SEO Agent] Writer call failed for 'X'" instead of a stack trace buried in the CI log. Thirty lines of code; removed all the debugging time.

What I would change if I were starting over

The Editor's binary checklist grew organically from "fix this specific thing the Writer just produced" cycles. After 25 checks, the right structure is probably to group them by phase (frontmatter, structure, voice, brand) and run them in stages so the rejection message tells the Writer which class of issue to focus on first. Currently every retry tries to fix everything at once, which works but reads as a long list of unrelated complaints in the prompt.

The cost-tracking JSONL is fine for an aggregator command, but a small dashboard with weekly cost-per-post charts would have caught one optimization opportunity earlier (the editorial voice pivot dropped Editor LLM calls from two per pass to one; I noticed three weeks after I could have saved $0.005 per post).

I would also invest sooner in a "sample 10 posts and read them critically" cadence. Most of the prompt-iteration value came from reading actual output, not from designing prompts in the abstract.

The cost question, summarised

Writing 1500-2000 words of editorial content in three languages, with custom diagrams and OG cards, for around $1.50 per post. Including a real human review step (mine) and a four-second auto-merge for the translations. Monthly API spend at one post a week: about $7. Annually: around $85.

A freelance writer + translator + illustrator on the same scope is $400-1000 per post. The system is 0.2-0.4% of that cost, with the trade-off that I do the strategic work (topic curation, prompt iteration, taste review).

It is not magic. It is seven specialized prompts, regex-first quality gates, prompt caching, a retry loop with structured feedback, and a discipline of not pretending the agents have lived an experience. The architecture diagram up top is the system. Everything else was calibration.

What you can take from this

If you build an indie product that needs SEO content without a salaried content team, the architecture is replicable. The hard parts are not the agents themselves. They are:

The editorial taste they are trained on, which lives in two markdown files (agent-context/founder-voice.md and agent-context/products.md) you will rewrite many times before the output stops sounding generic.
The deterministic checks, which look trivial in isolation but compound. Every regex check that catches a specific failure mode is one fewer LLM call you pay for.
The retry loop with structured feedback, because LLM critics that only say "this is wrong" without an actionable instruction force the next iteration to guess.

If you want to see what the system actually outputs, the Mail2Follow blog runs entirely on this pipeline. The product itself (Mail2Follow) is a Chrome extension that tracks sent emails, drafts tone-matched follow-ups inside Gmail, and respects sensitive domains via the Mozilla Public Suffix List. It is the indie tool I needed the SEO content for in the first place.

If you have built something similar with a different architecture, I would actually like to hear how you handled the editor-critic loop. Mine is binary; the alternatives I have considered (LLM-as-critic with continuous scores, multi-pass with separate stylistic and factual reviewers) all cost more without an obvious quality lift, but I have not run that comparison rigorously.

DEV Community