DEV Community: Jesús Bosch Ayguadé

Today Mail2Follow is being featured on BetaList, some love it's appreciated! https://betalist.com/startups/mail2follow

Jesús Bosch Ayguadé — Sun, 07 Jun 2026 05:03:41 +0000

Mail2Follow: Track key Gmail threads and send AI-written | BetaList

Track key Gmail threads and send AI-written follow-ups

betalist.com

Automating Reddit signal for your product: What worked, what got my 5 years old with 8000 karma account banned, and how to do it wisely

Jesús Bosch Ayguadé — Thu, 04 Jun 2026 05:25:53 +0000

By a builder who learned the hard way

Reddit is still one of the highest-signal places on the internet for product builders. People show up with real problems, unfiltered opinions, and buying intent. The challenge isn’t finding conversations, it’s finding the right ones at scale without destroying your account in the process.

This is the story of an automation experiment I ran while promoting Mail2Follow, and the expensive lesson that came with it.

Why Reddit Feels Like a Goldmine (Until It Isn’t)

You are a developer, or let's say a Builder. With your brain + the power of AI you build your shiny new little tool. You have put so much effort into it, but now you need feedback from real users. And everyone will tell you that Reddit is the right place for that.

The thing is, if your product solves a painful, recurring problem, client ghosting on proposals, forgotten follow-ups, chasing invoices, sales pipeline leaks so Reddit is full of people venting about exactly that in r/entrepreneur, r/sales, r/freelance, r/smallbusiness, r/consulting, and niche subs.

Manually searching every day manually in the era of AI simply doesn't make sense, it would take you months to get literally a handful of users. So I built a system to surface relevant threads automatically.

The Workflow I Built for Mail2Follow

The goal was simple: Every morning, get a shortlist of Reddit conversations that actually match what Mail2Followhelps with (tracking important emails, AI-assisted follow-ups, smart reminders for proposals and client threads), complete with context-aware draft replies ready for human review.

Here’s the high-level architecture that ran in n8n:

Daily Trigger + Multi-Source Ingestion

Scheduled workflow that pulls from 8–10 targeted RSS feeds or Reddit search URLs (new/hot posts in relevant subreddits + keyword searches like “follow up email no reply”, “proposal ghosted”, “chasing client payment”, “invoice reminder tool”, etc.).
Deduplication Layer

Every post ID/URL is checked against a Google Sheet (or Airtable/Notion DB). Already-seen threads are skipped. This prevents noise and keeps the list fresh.
Relevance Filtering with Lightweight AI

A cheaper/faster model (DeepSeek in this case) receives the post title + selftext + subreddit + some context about Mail2Follow. It scores relevance (1–10) on pain-point match and buying intent signals, then gives a short “why this matters” explanation. Only high-scoring threads move forward.
Response Drafting with Stronger Model

A more capable model (DeepSeek-V4 or equivalent) crafts a natural, helpful reply. The prompt emphasizes: be a fellow builder, add genuine value first, ask a question, never hard-sell, and only mention the product if it flows naturally from the conversation. The draft is stored alongside the thread link.
Human-in-the-Loop Output

Everything lands in a clean spreadsheet + a formatted beautiful Telegram notification with the top threads, scores, “why relevant” notes, and draft comments. I review, tweak heavily, and decide what (if anything) to post.

The technical side was satisfying. It worked. I woke up to a curated list of conversations that actually mattered instead of noise.

The Part That Backfired

My 5-year-old Reddit account with ~8,000 karma got banned.

Not because the workflow was broken, because I got overexcited.

I started engaging too frequently, too quickly, and too directly. I mentioned Mail2Follow more often than I should have. Reddit’s systems (and moderators) are extremely good at detecting coordinated or high-velocity self-promotion, even from established accounts. Velocity + keyword repetition + link patterns = red flags.

The automation gave me capacity. I used that capacity poorly.

Hard-Won Rules for Using This Responsibly

If you’re thinking about building something similar, here’s what I wish I had followed from day one:

Treat it as research + drafting assistance, not an autopilot poster.

Never let the AI post directly. Human review + heavy customization is non-negotiable.
Limit volume aggressively. Start with 1–3 thoughtful interactions per day max across your account. Space them out.
The draft is a starting point. Rewrite 60–80% of it in your own voice. Reddit smells generic AI text instantly.
Prioritize value. The best comments solve the OP’s problem or add unique perspective. Product mentions should feel like an afterthought (or not appear at all).

Reddit’s tolerance for self-promotion is near zero in most subs.

Read the rules of every subreddit you target. Many explicitly ban self-promotion or require disclosure.
Build genuine karma and history first. An account that only shows up to talk about its product looks suspicious.
Avoid the same keywords and subreddits every single day. Diversify your signal.
Watch for shadowbans. If your comments stop getting engagement or visibility, pause everything and investigate.

Better primary uses for this kind of system:

Market research and pain-point discovery (the highest-ROI use)
Finding threads where you can genuinely help as a builder (without linking)
Collecting language your ICP actually uses (gold for copy, onboarding, and support docs)
Early warning system for emerging complaints about your category

Direct lead-gen via comment spam is the fastest way to lose the channel entirely.

How to Build Something Similar (Without the Ban)

You don’t need my exact stack. Here are accessible starting points:

n8n (self-hostable, generous free tier) or Make.com for the orchestration.
Reddit RSS feeds or the official Reddit API (with proper auth) for ingestion.
Google Sheets + Apps Script or Airtable as lightweight state + output layer.
Any LLM with good reasoning (DeepSeek, Claude, GPT-4o-mini, Grok, etc.). Start cheap for the filter step, use stronger model only for final drafts.
Telegram or Slack for the notification layer.

Key prompt engineering tip: Give the model rich context about your product’s positioning and the specific pains it solves. The better the context, the better the relevance filter and the less “salesy” the drafts become.

Start manual for 1–2 weeks. Document what good engagement looks like in your niche. Only then introduce automation as a force multiplier.

The Real Lesson

Automation and AI are incredible at surfacing signal from Reddit’s firehose. They are terrible at replacing judgment, restraint, and authentic participation.

The builders who win on Reddit long-term are the ones who show up consistently as helpful humans first. The workflow should make you more thoughtful, not more prolific.

Ironically, while I was building systems to find conversations about email follow-up pain, the most important follow-up discipline was the one I neglected on the platform itself: patience, consistency, and not overstepping.

If you’re a founder, freelancer, or consultant who lives in email and constantly worries about proposals going cold or invoices getting lost in the void, the problem we’re solving with Mail2Follow is exactly that daily tax. It’s a lightweight AI layer inside Gmail that tracks the threads that matter, drafts follow-ups in your voice, and only nudges you when silence has gone on too long.

You can try it free at zinkforge.com/mail2follow.

But more importantly: use any automation you build on Reddit with extreme care. The platform rewards genuine participation far more than clever systems. I learned that the expensive way.

What’s your experience with Reddit as a growth or research channel? Have you experimented with automation there, or do you stay fully manual? I’d love to hear what’s worked (or backfired) for you in the comments. Everyone says "use Reddit" but I told you the unspoken secret. Tell me yours now :-)

About the author

Builder shipping in public. Previously exited a company via PE. Currently working on Mail2Follow and other tools that remove friction from client work as side projects. You can find me on X, usually trying new things and sharing my lessons learned.

This article is based on a real experiment shared on Reddit. The account in question was banned. All advice comes from direct experience and observation of what actually triggers Reddit’s moderation systems.

I deliberately vibe-coded a real product end-to-end. Here's what AI couldn't do for me

Jesús Bosch Ayguadé — Tue, 12 May 2026 19:59:14 +0000

I deliberately vibe-coded a real product end-to-end. Here's what AI couldn't do for me

A few months ago I decided to run a deliberate experiment: build and ship a real product using AI as much as humanly possible. The Chrome extension itself, the backend, the marketing website at zinkforge.com, the brand assets, the SEO content pipeline, the analytics. The lot. I wanted to know how far vibe coding actually goes when you push it into production, not into toy projects.

The product is Mail2Follow, a Chrome extension that adds follow-up tracking and open detection inside Gmail. It is live on the Chrome Web Store, the Edge Add-ons store, the Google Workspace Marketplace, and integrated with Zapier.

On top of it I also built an autonomous SEO content system: a seven-agent setup that picks topics, writes posts, generates images, and translates everything for me on a cron that I described in more detail here.

This article is about everything that experiment taught me. What AI handled well, the parts where it broke completely, and the things I ended up paying a human for anyway.

The "one prompt" myth

If you read X long enough you'll get the impression that modern AI lets you build a product in a single prompt. That is a lie.

Mail2Follow took hundreds of prompts. Bug fixes, edge cases, refactors, alignment passes when the model lost the thread, debugging sessions where I had to walk it back manually. In some of those cases I'd have fixed the bug faster by hand than by prompting. I kept prompting anyway, partly to honour the experiment, partly because the muscle memory of "ask the AI" is real once it works.

Net of all that, the productivity gain is still enormous. I'm just no longer interested in the marketing line.

What AI got right

The breadth is real. A single technical person can plausibly cover:

Chrome extension scaffolding, manifest, service workers, message passing.
Cloudflare Workers backend, D1, Turnstile, Astro for the public site.
Marketing copy in seven languages.
GA4 + Looker Studio dashboards.
Privacy policy, terms, support docs, changelogs.
First drafts of the SEO multi-agent code itself.

If your model of vibe coding is "AI writes the easy parts", that model is wrong. AI writes most of the parts. The interesting question is which parts it cannot write.

Where AI broke #1: injecting into Gmail's UI

Mail2Follow lives inside Gmail. There is no public UI API. You inject DOM elements next to Google's, against a tree where the class names look like gE iv gt and change whenever Google refactors a component. Your extension that worked yesterday is broken today, and you do not find out until a user emails you.

AI is close to useless here for three reasons:

It has no idea what the current Gmail DOM looks like. Its training data is months or years old; Gmail changes faster than that.
The selectors it suggests are confident and wrong. It pattern-matches what Gmail extensions look like in general, not what the actual current structure is.
You have to debug live, against a running Gmail tab, with no source map.

What worked:

Anchor selectors on stable attributes (data-tooltip, aria-label, role) whenever possible.
Multiple fallback strategies. Try the ideal selector, fall back to structural, then to text content matching.
Mutation observers, retries, defensive wrappers around every DOM operation.
A small monitoring layer that pings me when injection success rate drops in production.

Of all the engineering time on this product, the DOM injection layer alone was roughly 40 to 50%. AI helped with maybe 10% of it.

Where AI broke #2: visual taste

Three different fronts, exactly the same problem.

The marketing website. zinkforge.com itself was vibe-coded end-to-end: Astro, Cloudflare Pages, every component and every line of CSS generated by the model. And the first version looked like every other AI-generated SaaS landing on the internet: pastel gradient, three icons in a row, the same hero layout you've seen on a hundred YC company pages. The fix was not a better prompt. The fix was a DESIGN.md file in the repo with explicit references (Google's design language, specific competitor pages I liked, exact spacing rules, typography choices, banned patterns) that the model was forced to read and obey on every change. Same model, dramatically different output once the constraints existed in writing.

UI inside the extension. Same problem. The first dropdown the AI produced was generic shadcn-flavoured. I had to write rules about visual register, motion, density, color discipline. Output got dramatically better once those rules existed in writing.

Marketing art. Banners, Product Hunt assets, OG images. AI image generators are confidently wrong about typography and brand consistency. I solved it by writing a custom skill (a structured prompt + reference pack + binary checklist) that I now reuse across products. It's public on my GitHub.

The pattern across the website, the extension UI and the marketing art was the same: AI does not have taste. It has averages. Taste is a constraint you supply in writing; it is not something the model generates on its own. Every time I codified the constraint into a .md file in the repo, the output got dramatically better. Every time I left it implicit, the output drifted back to average.

Where AI broke #3: assets I gave up on

Even after the skill iteration, the Chrome Web Store screenshots and the product icon were not where I wanted them. So I did the unglamorous thing: I hired a designer. A few hundred euros, two iterations, done.

The icon and the store screenshots are the things people see before they install. That is the wrong place to be 80% there. Pay the human.

Where AI broke #4: SEO content slop

Building the SEO agent was easier than making it not write slop. The first draft of the system produced perfectly fluent posts that any reader would close within ten seconds because they were unmistakably written by a model that had read every "ultimate guide to email follow-ups" on the internet.

The fix was layered:

A separate Editor agent that runs the Writer's output through a binary acceptance checklist (no first-person fabrications, no invented stats, no banned opening hooks, no banned transitions).
An explicit editorial voice document with concrete examples of what to do and what not to do.
A reader profile pinning who we're writing for, so anecdotes have to anchor to named, real-sounding roles.
Multiple narrow LLM calls per post, not one big call.

The Editor catches most slop. Some still leaks through. It's better than human-equivalent first drafts. It's not better than my own best writing, and I'm okay with that.

What you can't outsource: debugging

The hardest hours on Mail2Follow were not features. They were circular bugs, the kind where fixing one breaks another. The AI does not hold the full system in its head across long sessions; it suggests confident local fixes that break a distant piece you forgot to mention. You hit a fix-break-fix-break loop and the only way out is to step out of it manually.

What helped me get out:

Switching to a second LLM for a fresh perspective. Different priors, different blind spots. On the worst bugs I'd consult two or three models in parallel and reconcile them.
Stepping away and writing on paper what each component actually does, then comparing to what the AI thinks they do.
Sometimes just fixing it by hand. Faster, cleaner, done.

This is the part nobody mentions when they say "anyone can build a product now". You can vibe-code the happy path. The unhappy paths still need an engineer.

Could a non-developer have done this?

No. Honestly, no.

Not because the AI can't write the code, but because of everything around it: spotting when the model is confidently wrong, reading stack traces, knowing when a "clean fix" is actually a regression waiting to happen, holding the architecture in your head while the AI fixates on the local function, sensing that the bug isn't where the AI says it is.

The parts AI handles best (boilerplate, copy, scaffolding) get you a working prototype. The parts AI handles worst (debugging, taste, judgement on whether output is good enough) quietly determine whether the prototype ever becomes a product.

The publication grind

Non-technical wall. Chrome Web Store reviews are cryptic. Google Workspace Marketplace needs OAuth verification, a privacy policy on a verified domain, a YouTube demo video, security review, and brand verification. Edge and Firefox each have their own variants. Each one is a human queue.

AI helped me draft the privacy policy and the permission justifications. The waiting was human waiting on human.

The honest summary

I would run this experiment again. The breadth a single technical person can cover with current AI is genuinely new. Three shipped products this year would have been impossible for me without it.

The line where AI stops:

Volatile or undocumented surfaces (Gmail's DOM, Google review feedback).
Visual taste without explicit constraints written down.
Final-mile assets where the standard has to be high.
Long debugging sessions across a system the AI never sees end to end.
Recognising when the model is gaslighting itself with a confident wrong answer.

Vibe coding is real. It is also less hands-off than the demos suggest. Pick projects where the hard part is something AI can actually help with, write down your constraints when AI keeps producing averages, and pay a human for the things that need to be excellent on day one.

Mail2Follow is on the Chrome Web Store, Edge Add-ons, and the Google Workspace Marketplace. The SEO multi-agent code is private but I'm happy to chat about the design. More about what I'm building at zinkforge.com.

Building an autonomous multi-agent SEO system with Claude + GitHub Actions ("cheap" and quick)

Jesús Bosch Ayguadé — Sat, 09 May 2026 17:40:18 +0000

I needed SEO content for a Chrome extension I built. The arithmetic of doing it myself was discouraging: six to eight hours per post if I wanted it to be good, plus another two for translations into Catalan and Spanish (I know, I know). I was not going to write four posts a week. I was not going to write one post a week reliably either.

I knew AI Agents could come to the rescue, but those usually require an expensive token economy... and I'm on a budget, so I had to find creative ways to get the work done without breaking the bank.

So I came up with a system that does it. Seven Claude-backed agents orchestrated through five GitHub Actions workflows, with the human only stepping in to merge a PR (or to ignore it). It runs at around $1.50 per article in three languages, including custom SVG diagrams and Open Graph cards. At my publishing cadence (one post a week), the API spend is roughly $7 a month.

Below is what is actually inside it, with the design choices that made it cheap.

The five flows

Each flow is a workflow file in the consumer-site repo. The agent code itself lives in a separate Python package; each workflow installs it via pip install git+...@main and calls a CLI entry point. That separation means I iterate prompts without touching workflows, and bumping the agent version is a one-line change in the workflow file.

Flow A (weekly content): cron-triggered. Strategist picks the next pending topic from a hand-curated backlog, Writer produces an MDX post, Editor accepts or rejects with structured feedback, retry loop up to three times. Image Generator produces a thumbnail, an OG image, and one to three inline SVG diagrams. Build verification, PR opened with a content-review label, I get an email.
Flow B (PR feedback): triggered when I comment @seo-agent ... on the open PR. Writer regenerates the MDX with my instruction. No Editor here, because I am the reviewer at this point.
Flow C (technical SEO audit): issue-triggered. Technical Analyst reads a Search Console warning I paste in, decides if it is a false positive, opens a fix PR only if not.
Flow D (monthly topic research): cron-triggered. Topic Generator uses Anthropic's server-side web_search to read Reddit, indie-developer forums, and competitor blogs, then proposes five to six topics aligned with my reader profile. The PR auto-merges and the backlog refills.
Flow E (post-merge translation): triggered when I merge Flow A's PR. Translator produces Catalan and Spanish versions, Image Generator translates body-diagram SVG labels per locale, the translation PR opens and auto-merges in four seconds.

Only Flow A and Flow C produce PRs that need my review. Flow D and Flow E auto-merge because the human gate is upstream (I curate the topic in D, I approved the EN draft in A before E fired).

Per-agent model selection (where the savings come from)

Most multi-agent setups I have read about use one model for everything. That is the wrong abstraction. Each agent has a different job, and each job has a different cost-quality sweet spot.

STRATEGIST_MODEL = "claude-haiku-4-5"        # JSON in, JSON out, cheap
WRITER_MODEL = "claude-opus-4-7"             # long-form prose, voice matters
EDITOR_LLM_MODEL = "claude-haiku-4-5"        # one binary judgment
TECHNICAL_ANALYST_MODEL = "claude-sonnet-4-6"
TOPIC_GENERATOR_MODEL = "claude-sonnet-4-6"  # needs web_search (no Haiku support)
TRANSLATOR_MODEL = "claude-sonnet-4-6"       # structure-preserving translation
IMAGE_GENERATOR_MODEL = "claude-sonnet-4-6"  # SVG with strict layout rules

Real per-post breakdown from the runs log:

Agent	Model	Cost
Strategist	Haiku 4.5	$0.01
Writer (1 iteration)	Opus 4.7	$0.60-0.75
Editor (1 LLM call)	Haiku 4.5	$0.01
Image Generator (thumbnail + OG + 1 diagram)	Sonnet 4.6	$0.10
Flow A subtotal		~$0.76
Translator x 2 locales	Sonnet 4.6	$0.20
OG localized x 2	Sonnet 4.6	$0.04
Diagram translate x 2 locales	Sonnet 4.6	$0.14
Flow E subtotal		~$0.38
Topic Generator (monthly, amortized over 4 posts)	Sonnet 4.6 + web_search	$0.22
Total per post (three languages, all assets)		~$1.36

Writer cost is 60-70% of the total. I tried Sonnet 4.6 first; the prose was technically fine but read formulaic, with the kind of "X takes care of itself" tics that betray automated generation. Opus produces a noticeably more idiomatic editorial voice. At the volume I publish, the absolute monthly delta is around $5. Worth it.

Prompt caching across the Writer-Editor loop

The Writer can need two or three iterations to pass the Editor. The brief, the voice guide, the product doc, and the component schemas are identical across iterations. Naively re-sending the whole prompt every iteration means paying full input rate three times.

Anthropic's prompt caching solves this. The Writer call splits its prompt into a stable prefix (cached) and a dynamic suffix (the editor feedback and previous draft):

result = client.complete(
    system=WRITER_SYSTEM_PROMPT,
    user=_build_dynamic_suffix(inputs),
    cached_user_prefix=_build_cached_prefix(inputs),
    max_tokens=8192,
    model=WRITER_MODEL,
)

The cached prefix is sent once and reused for about five minutes. Iterations 2 and 3 pay roughly 10% of the full input rate on those tokens. On a real post that translates to $0.10-0.15 saved per retry, and Opus retries are not cheap.

The Editor: regex first, Claude only when needed

The Editor runs around 21 deterministic checks before invoking any LLM:

Frontmatter: date is a quoted ISO string, title under 60 chars, excerpt under 160 chars, slug is strict kebab-case.
Structure: TldrBlock immediately after intro, anchor links resolve to H2 ids, components match the brief, visual variety (no two consecutive prose-only sections), pull-quote count between one and three, section word counts within tolerance.
Body content rules: no meta-instruction phrases ("here is how", "in this post we"), no SDR-style EmailMockup copy ("discovery call", "sales sequence"), mandatory Mail2Follow CTA block at the end, product link present at least once.
Voice and brand: banned-phrase list clean (no "leverage", no em dashes, etc.), no founder names in EmailMockup signatures, no cross-promotion of other products.

These cost nothing. They run on the Writer's output before any Claude call, and they catch about 80% of the failure modes the Writer produces.

After the deterministic checks, one LLM call to Haiku 4.5 (around $0.01) handles what regex cannot judge: fabricated numeric claims and fabricated personal anecdotes. The check returns structured JSON the pipeline injects into the next Writer iteration as a specific instruction.

This regex-first design is why the Editor is essentially free per pass. Most multi-agent systems I have read about use an LLM critic for everything, which works, but it costs 30-50x more for the same job.

The retry loop with structured feedback

When the Editor rejects, the Writer is called again with the original brief, the previous draft, and a list of failed checks. Each FailedCheck carries a specific, actionable instruction:

FailedCheck(
    check="no_sensory_anecdotes",
    instruction=(
        "The body contains 'a reform contractor I know in Sant Cugat'. "
        "The agent cannot have first-hand sensory or biographical "
        "anecdotes. Rewrite as an archetypal scenario."
    ),
)

The Writer reads these in its prompt as a numbered list of items to fix. Up to three iterations. If the Editor still rejects after retry three, the workflow opens a GitHub issue with the failed drafts and exits non-zero. I see the issue in my inbox the next morning.

Anti-fabrication rules

Early in development the Writer was producing prose like "A reform contractor I know in Sant Cugat sent five quotes one Tuesday in March". Compelling reading. Also entirely invented. The Writer has no body, did not visit places, did not meet contractors. Pretending otherwise is an ethical line.

The current rule:

The Writer cannot use first-person sensory or biographical constructions ("I saw", "I drove past", "told me over coffee", "a [role] I know in [place]").
Replace with archetypes: "Consider a freelance architect who sends four proposals a month..."

Deterministic regex catches the obvious patterns; the LLM check catches subtler fabrications.

This is the rule that took the most calibration. The rules that caused the Writer to fabricate were the rules I had written months earlier ("first-person founder voice, use 'I' and 'my', reference real founder experience") to make the prose more vivid. The vividness came from invention. Removing the mandate produced posts that read more reserved but truthful.

Cultural translation, not literal

The Translator started by producing word-perfect Catalan and Spanish that no native speaker would actually write. "Just checking in" became "només volia saber com estàs" — a calque from English that comes out of Google Translate, never a Catalan inbox.

The fix was an idiom-substitution table embedded in the Translator prompt:

| English source        | Catalan         | Spanish         |
|---|---|---|
| just checking in      | com anem?       | qué tal?        |
| circling back         | et torno a      | vuelvo a        |
|                       | escriure        | escribirte      |
| touch base            | fer un toc      | ponernos en     |
|                       |                 | contacto        |
| outreach              | captació        | captación       |
| chase an invoice      | reclamar una    | reclamar una    |
|                       | factura         | factura         |

Plus an explicit instruction: the model is authorized to deviate from the literal source when a literal translation would produce a calque. The constraint is faithfulness to the post's intent, not to the source's word order.

Body diagrams (SVG with text labels) get a separate translation pass: an Image Generator function reads the EN SVG and rewrites only the <text> content into the target locale, preserving every other byte (geometry, classes, fills, transforms). The Catalan post points at <localized-slug>-<section>.svg, the Spanish post at its own variant.

Three things that surprised me in production

MDX paragraph-wrapping breaks layout in ways HTML does not. The CTA pill at the bottom of every post is <a class="cta-pill"><span>Try Mail2Follow free</span><svg .../></a>. With the text on a separate line in the source, the MDX compiler wraps it in a <p> element, which is block-level, which breaks the inline-flex layout and pushes the chevron to its own line. Reference posts authored in raw .md were unaffected. Agent-authored .mdx posts hit it on every build until I switched the anchor contents to a single line and wrapped the label in a <span> as a second line of defence.

Slug paths with hyphens trip greedy regex. The body-diagram src is /blog-images/diagrams/<slug>-<section>.svg. Both slug and section are kebab-case strings; a single regex over both cannot determine the split (follow-up-quote-contractor-no-response-five-day-window could split anywhere there is a hyphen). The fix is to compile the regex per call against the actual source slug as a literal:

diagram_re = re.compile(
    rf"/blog-images/diagrams/{re.escape(source_slug)}-(?P<section>[a-z0-9-]+)\.svg"
)

I caught this on the first real run with body-diagram translation enabled. The pipeline silently produced no localized diagrams because the regex captured the wrong split, the slug filter rejected every match, and the loop iterated zero times. The runs log showed zero diagram_translate entries; the symptom was the absence of expected work.

The cheapest failure to recover from is the one that opens a tracking issue. The pipeline catches AnthropicClientError around each agent call and opens a GitHub issue with the brief, the iteration index, and the error message. When Anthropic deprecated temperature on Opus 4.7, the workflow exited with a clean issue titled "[SEO Agent] Writer call failed for 'X'" instead of a stack trace buried in the CI log. Thirty lines of code; removed all the debugging time.

What I would change if I were starting over

The Editor's binary checklist grew organically from "fix this specific thing the Writer just produced" cycles. After 25 checks, the right structure is probably to group them by phase (frontmatter, structure, voice, brand) and run them in stages so the rejection message tells the Writer which class of issue to focus on first. Currently every retry tries to fix everything at once, which works but reads as a long list of unrelated complaints in the prompt.

The cost-tracking JSONL is fine for an aggregator command, but a small dashboard with weekly cost-per-post charts would have caught one optimization opportunity earlier (the editorial voice pivot dropped Editor LLM calls from two per pass to one; I noticed three weeks after I could have saved $0.005 per post).

I would also invest sooner in a "sample 10 posts and read them critically" cadence. Most of the prompt-iteration value came from reading actual output, not from designing prompts in the abstract.

The cost question, summarised

Writing 1500-2000 words of editorial content in three languages, with custom diagrams and OG cards, for around $1.50 per post. Including a real human review step (mine) and a four-second auto-merge for the translations. Monthly API spend at one post a week: about $7. Annually: around $85.

A freelance writer + translator + illustrator on the same scope is $400-1000 per post. The system is 0.2-0.4% of that cost, with the trade-off that I do the strategic work (topic curation, prompt iteration, taste review).

It is not magic. It is seven specialized prompts, regex-first quality gates, prompt caching, a retry loop with structured feedback, and a discipline of not pretending the agents have lived an experience. The architecture diagram up top is the system. Everything else was calibration.

What you can take from this

If you build an indie product that needs SEO content without a salaried content team, the architecture is replicable. The hard parts are not the agents themselves. They are:

The editorial taste they are trained on, which lives in two markdown files (agent-context/founder-voice.md and agent-context/products.md) you will rewrite many times before the output stops sounding generic.
The deterministic checks, which look trivial in isolation but compound. Every regex check that catches a specific failure mode is one fewer LLM call you pay for.
The retry loop with structured feedback, because LLM critics that only say "this is wrong" without an actionable instruction force the next iteration to guess.

If you want to see what the system actually outputs, the Mail2Follow blog runs entirely on this pipeline. The product itself (Mail2Follow) is a Chrome extension that tracks sent emails, drafts tone-matched follow-ups inside Gmail, and respects sensitive domains via the Mozilla Public Suffix List. It is the indie tool I needed the SEO content for in the first place.

If you have built something similar with a different architecture, I would actually like to hear how you handled the editor-critic loop. Mine is binary; the alternatives I have considered (LLM-as-critic with continuous scores, multi-pass with separate stylistic and factual reviewers) all cost more without an obvious quality lift, but I have not run that comparison rigorously.

Create a Pareto chart with ggplot with R

Jesús Bosch Ayguadé — Tue, 24 Jun 2025 09:19:17 +0000

Vilfredo Fritz Pareto was an Italian statistician and sociologist that described the famous Pareto Principle.

In short, it says that 80% of the outcome is explained by 20% of the causes. That means that if you are able to identify that small hidden root cause, you can fix 80% of your issues.

In R we can do this with the built in graphics... or we can go with the way more trendy, fashionable, and powerful library ggplot (self called "the grammar of graphics").

First of all, we will need the data to feed the chart.

# Sample data: defect types and their frequencies
defects <- c("A" = 50, "B" = 30, "C" = 15, "D" = 5)

# Sort in descending order
defects_sorted <- sort(defects, decreasing = TRUE)

# Calculate cumulative percentages
cumulative_freq <- cumsum(defects_sorted)
cumulative_pct <- cumulative_freq / sum(defects_sorted) * 100

df_defects <- data.frame(
category = names(defects_sorted),
frequency = as.numeric(defects_sorted),
cumulative_freq,
cumulative_pct
)

What happened above? We created a dataframe of different deffects and their frequency (A happens 50 times, B 30 times, etc).

After that, we sort the defects in descending order (remember we want to find the root causes that cause the 80% of trouble right? easier to see visually if we put the larger ones first).

Then we complete a typical frequency table with the cumulative frequency and the cummulative relative frequency (which is the data we want to show in the chart later on).

Now we have a beautiful data frame that renders contains the following data:

category frequency cumulative_freq cumulative_pct
A A 50 50 50
B B 30 80 80
C C 15 95 95
D D 5 100 100

If you don't have the ggplot library installed, simply install it with the following instruction:

install.packages("ggplot")

Then, let the magic happen:
library(ggplot2)
ggplot(data = df_defects, mapping = aes(x = category, y = frequency)) +
geom_col() +
geom_line(mapping = aes(x = category, y = cumulative_pct), group = 1,
colour = "red", size = 3) +
xlab("Category") +
ylab("Frequency")

ggplot can look a bit scary at first.. but believe me it has all the logic in the world after the initial little learning curve. For now let's look at the (beautiful) output:

As you can see, the red line shows the % accumulated on the top of every category, in this example we see that we need the 3 initial columns on the left to get above the 80%.

Going back to the code.

ggplot() creates an empty canvas and tells it which dataset to use (df_defects) and how to map the data (categories go on the x axis and frequencies on the y axis).
geom_col() draws the actual bars based on those mappings.
geom_line() adds a red line showing cumulative percentages, with group = 1 telling ggplot to connect all points into one continuous line, and size = 3 making it thick and visible.
xlab() and ylab() add descriptive labels to the axes so readers know what they're looking at.

In resume, and contrary to other methods I am more used to, and this blew my mind at the beginning, in ggplot2, you build visualizations by adding layers with the + operator, like stacking transparent sheets.

Each function adds a specific element. You start with ggplot() that creates the foundation, geom_col() adds bars, geom_line() overlays a line, and xlab() or ylab() add text labels.

This is a modular approach that lets you combine different visual elements incrementally, reaching high levels of personalizatin and beautiness if you have some decent taste.

If you want to learn more about ggplot, I recommend this other article from another user at dev.to