I deliberately vibe-coded a real product end-to-end. Here's what AI couldn't do for me
A few months ago I decided to run a deliberate experiment: build and ship a real product using AI as much as humanly possible. The Chrome extension itself, the backend, the marketing website at zinkforge.com, the brand assets, the SEO content pipeline, the analytics. The lot. I wanted to know how far vibe coding actually goes when you push it into production, not into toy projects.
The product is Mail2Follow, a Chrome extension that adds follow-up tracking and open detection inside Gmail. It is live on the Chrome Web Store, the Edge Add-ons store, the Google Workspace Marketplace, and integrated with Zapier.
On top of it I also built an autonomous SEO content system: a seven-agent setup that picks topics, writes posts, generates images, and translates everything for me on a cron that I described in more detail here.
This article is about everything that experiment taught me. What AI handled well, the parts where it broke completely, and the things I ended up paying a human for anyway.
The "one prompt" myth
If you read X long enough you'll get the impression that modern AI lets you build a product in a single prompt. That is a lie.
Mail2Follow took hundreds of prompts. Bug fixes, edge cases, refactors, alignment passes when the model lost the thread, debugging sessions where I had to walk it back manually. In some of those cases I'd have fixed the bug faster by hand than by prompting. I kept prompting anyway, partly to honour the experiment, partly because the muscle memory of "ask the AI" is real once it works.
Net of all that, the productivity gain is still enormous. I'm just no longer interested in the marketing line.
What AI got right
The breadth is real. A single technical person can plausibly cover:
- Chrome extension scaffolding, manifest, service workers, message passing.
- Cloudflare Workers backend, D1, Turnstile, Astro for the public site.
- Marketing copy in seven languages.
- GA4 + Looker Studio dashboards.
- Privacy policy, terms, support docs, changelogs.
- First drafts of the SEO multi-agent code itself.
If your model of vibe coding is "AI writes the easy parts", that model is wrong. AI writes most of the parts. The interesting question is which parts it cannot write.
Where AI broke #1: injecting into Gmail's UI
Mail2Follow lives inside Gmail. There is no public UI API. You inject DOM elements next to Google's, against a tree where the class names look like gE iv gt and change whenever Google refactors a component. Your extension that worked yesterday is broken today, and you do not find out until a user emails you.
AI is close to useless here for three reasons:
- It has no idea what the current Gmail DOM looks like. Its training data is months or years old; Gmail changes faster than that.
- The selectors it suggests are confident and wrong. It pattern-matches what Gmail extensions look like in general, not what the actual current structure is.
- You have to debug live, against a running Gmail tab, with no source map.
What worked:
- Anchor selectors on stable attributes (
data-tooltip,aria-label,role) whenever possible. - Multiple fallback strategies. Try the ideal selector, fall back to structural, then to text content matching.
- Mutation observers, retries, defensive wrappers around every DOM operation.
- A small monitoring layer that pings me when injection success rate drops in production.
Of all the engineering time on this product, the DOM injection layer alone was roughly 40 to 50%. AI helped with maybe 10% of it.
Where AI broke #2: visual taste
Three different fronts, exactly the same problem.
The marketing website. zinkforge.com itself was vibe-coded end-to-end: Astro, Cloudflare Pages, every component and every line of CSS generated by the model. And the first version looked like every other AI-generated SaaS landing on the internet: pastel gradient, three icons in a row, the same hero layout you've seen on a hundred YC company pages. The fix was not a better prompt. The fix was a DESIGN.md file in the repo with explicit references (Google's design language, specific competitor pages I liked, exact spacing rules, typography choices, banned patterns) that the model was forced to read and obey on every change. Same model, dramatically different output once the constraints existed in writing.
UI inside the extension. Same problem. The first dropdown the AI produced was generic shadcn-flavoured. I had to write rules about visual register, motion, density, color discipline. Output got dramatically better once those rules existed in writing.
Marketing art. Banners, Product Hunt assets, OG images. AI image generators are confidently wrong about typography and brand consistency. I solved it by writing a custom skill (a structured prompt + reference pack + binary checklist) that I now reuse across products. It's public on my GitHub.
The pattern across the website, the extension UI and the marketing art was the same: AI does not have taste. It has averages. Taste is a constraint you supply in writing; it is not something the model generates on its own. Every time I codified the constraint into a .md file in the repo, the output got dramatically better. Every time I left it implicit, the output drifted back to average.
Where AI broke #3: assets I gave up on
Even after the skill iteration, the Chrome Web Store screenshots and the product icon were not where I wanted them. So I did the unglamorous thing: I hired a designer. A few hundred euros, two iterations, done.
The icon and the store screenshots are the things people see before they install. That is the wrong place to be 80% there. Pay the human.
Where AI broke #4: SEO content slop
Building the SEO agent was easier than making it not write slop. The first draft of the system produced perfectly fluent posts that any reader would close within ten seconds because they were unmistakably written by a model that had read every "ultimate guide to email follow-ups" on the internet.
The fix was layered:
- A separate Editor agent that runs the Writer's output through a binary acceptance checklist (no first-person fabrications, no invented stats, no banned opening hooks, no banned transitions).
- An explicit editorial voice document with concrete examples of what to do and what not to do.
- A reader profile pinning who we're writing for, so anecdotes have to anchor to named, real-sounding roles.
- Multiple narrow LLM calls per post, not one big call.
The Editor catches most slop. Some still leaks through. It's better than human-equivalent first drafts. It's not better than my own best writing, and I'm okay with that.
What you can't outsource: debugging
The hardest hours on Mail2Follow were not features. They were circular bugs, the kind where fixing one breaks another. The AI does not hold the full system in its head across long sessions; it suggests confident local fixes that break a distant piece you forgot to mention. You hit a fix-break-fix-break loop and the only way out is to step out of it manually.
What helped me get out:
- Switching to a second LLM for a fresh perspective. Different priors, different blind spots. On the worst bugs I'd consult two or three models in parallel and reconcile them.
- Stepping away and writing on paper what each component actually does, then comparing to what the AI thinks they do.
- Sometimes just fixing it by hand. Faster, cleaner, done.
This is the part nobody mentions when they say "anyone can build a product now". You can vibe-code the happy path. The unhappy paths still need an engineer.
Could a non-developer have done this?
No. Honestly, no.
Not because the AI can't write the code, but because of everything around it: spotting when the model is confidently wrong, reading stack traces, knowing when a "clean fix" is actually a regression waiting to happen, holding the architecture in your head while the AI fixates on the local function, sensing that the bug isn't where the AI says it is.
The parts AI handles best (boilerplate, copy, scaffolding) get you a working prototype. The parts AI handles worst (debugging, taste, judgement on whether output is good enough) quietly determine whether the prototype ever becomes a product.
The publication grind
Non-technical wall. Chrome Web Store reviews are cryptic. Google Workspace Marketplace needs OAuth verification, a privacy policy on a verified domain, a YouTube demo video, security review, and brand verification. Edge and Firefox each have their own variants. Each one is a human queue.
AI helped me draft the privacy policy and the permission justifications. The waiting was human waiting on human.
The honest summary
I would run this experiment again. The breadth a single technical person can cover with current AI is genuinely new. Three shipped products this year would have been impossible for me without it.
The line where AI stops:
- Volatile or undocumented surfaces (Gmail's DOM, Google review feedback).
- Visual taste without explicit constraints written down.
- Final-mile assets where the standard has to be high.
- Long debugging sessions across a system the AI never sees end to end.
- Recognising when the model is gaslighting itself with a confident wrong answer.
Vibe coding is real. It is also less hands-off than the demos suggest. Pick projects where the hard part is something AI can actually help with, write down your constraints when AI keeps producing averages, and pay a human for the things that need to be excellent on day one.
Mail2Follow is on the Chrome Web Store, Edge Add-ons, and the Google Workspace Marketplace. The SEO multi-agent code is private but I'm happy to chat about the design. More about what I'm building at zinkforge.com.


Top comments (0)