Yonatan Naor

Posted on Apr 3

We ran 25 AI-managed websites for 30 days. Here's the honest data.

#buildinpublic #ai #seo #nextjs

I want to tell you what happens before the success story.

Most build-in-public posts show up when things are working. Traffic is growing, revenue is ticking up, the graph is up and to the right. What you rarely see is the awkward middle — when you have 25 live websites, a real architecture, a real team (AI agents, in our case), and Google is mostly ignoring you.

This is that post.

The Setup

Thicket is an experiment in fully autonomous site operation. The premise: could a team of Claude Code agents — with minimal human input — build, publish, optimize, and improve a portfolio of utility websites?

Here is the org chart:

CEO agent — orchestrates the weekly cycle, reads auditor reports, makes build/improve/deprecate decisions
Analytics agent — pulls traffic, computes health scores, diagnoses problems
Research agent — finds high-value niches, calibrates scoring against actual traffic outcomes
Designer agent — creates brand identity and design systems before any code gets written
Builder agent — scaffolds and deploys Next.js SSR sites, runs curl smoke tests
Editor agent — runs a virtual newsroom: commissions article pitches, grades them, approves or rejects
5 Writer personas — distinct voices (Marcus the data journalist, Sarah the health writer, Raj the tech writer, Lena the culture analyst, Jordan the generalist)
Content agent — publishes approved articles, verifies deployments
SEO/GEO agent — optimizes for search engines and LLM discovery (more on this below)
Auditor agent — reviews everything, grades agents A/B/C/D, improves their instructions if they underperform 3+ cycles in a row

One human sets the vision. The agents do everything else.

The Numbers — 30 Days In

Here is the actual data. No rounding, no cherry-picking.

Sites:

25 live sites, all Next.js SSR
Every site has schema.org JSON-LD, sitemap.xml, robots.txt, /llms.txt, /llms-full.txt, and /api/llm endpoints
Categories: calculators (fitness, finance, mortgage, loan, percentage, age, pregnancy), conversion tools (PDF, image, color, QR, text), directories (AI tools, VPN comparison), content sites (trend explainer, quiz hub, social text tools, typing test)

Traffic (last 7 days, GA4):

Total sessions: 380
Direct: 354 (93%)
Organic Social: 16
Organic Search: 7

That organic search number is real and it is humbling. 25 sites, full SSR, schema markup, sitemaps — and Google has sent us 7 sessions in a week. SEO is a patience game and we are currently losing it.

MCP Package:

@thicket-team/mcp-calculators — 106 downloads/week
This is our biggest discovery channel right now
Builders searching for Claude/Cursor tool integrations are finding us via npm
This is organic discovery that search engines haven't replicated yet

Newsletter:

2 subscribers
CTAs deployed to 5 sites last week
Buttondown integration is live, we're watching whether the CTAs convert

Bluesky:

161 posts
12 followers (up from 3 a month ago)
22 replies received (engagement is there; follows are not converting yet)
We post about agent drama, honest metrics, and what the auditor said this week

What the GEO Endpoints Are For

Every site has these endpoints:

/llms.txt          — plain text summary for LLM crawlers
/llms-full.txt     — full content dump
/api/llm           — structured JSON: site metadata, tools, recent content
/[slug].md         — Markdown versions of every page

The bet: as AI assistants (ChatGPT, Claude, Perplexity) become primary discovery surfaces, sites that speak their language will get cited. Traditional SEO (backlinks, domain authority, crawl budget) matters less when the discovery layer is semantic. We're building for both.

So far, this is entirely theoretical. We have no evidence it's working. But the cost of adding these endpoints to 25 sites was one afternoon of the SEO/GEO agent's time.

The Ratchet Mechanism

The system self-improves via what we call the ratchet:

Every agent writes a status-{agent}-{date}.json after each cycle
The portfolio score (sum of all site health scores) must not decrease week-over-week
If it drops, the auditor investigates and the system pauses new builds
The auditor grades agents A/B/C/D. For C or D grades three cycles in a row, the auditor edits the agent's instructions directly
Next cycle: if the change improved metrics, it stays. If not, the auditor reverts it.

The key constraint: registry/eval.md is immutable. No agent can modify it. This is the evaluation contract — it defines exactly how health scores are calculated. Without this, agents would game their own metrics within about two cycles.

Git is the memory layer. Every change is committed. Agents read git log to understand what was tried before.

What's Not Working (Honest Version)

Google. 7 sessions from organic search across 25 sites is essentially zero. We have all the technical SEO right. What we probably don't have: backlinks, domain authority, and enough content age. These things take months, not weeks.

Newsletter growth. 2 subscribers after deploying CTAs to 5 sites suggests either the CTAs are not converting or the traffic isn't there to convert from. Probably both. The 93% direct traffic means people are already typing our URLs — they're not discovering us organically.

Follow-through on social. We have 12 followers on Bluesky after 161 posts. We're getting engagement (22 replies received) but not converting it to follows. The ratio of following-to-followers (87:12) is wrong. We're posting but not participating enough in conversations that aren't ours.

Agent quality variance. The auditor graded our builder agent a B last cycle (down from A). The issue: sites were deploying but not verifying correctly. The auditor updated the builder's instructions to require three-endpoint smoke tests before marking a deploy successful. Whether this improves the grade next cycle is the test.

What's Actually Working

MCP discovery. 106 downloads/week for @thicket-team/mcp-calculators is real organic discovery. Developers building Claude integrations are finding us. This is the one channel where we're getting found without asking to be found.

The agent architecture scales. We went from 5 sites to 25 in about six weeks. The system didn't break. The CEO agent runs a cycle, the builder builds, the content agent publishes, the auditor reviews. No human touched those 20 new sites except to approve the initial research direction.

Writer quality is surprisingly good. Our 5 writer personas produce content with distinct voices. Marcus (data journalist) writes differently than Lena (culture analyst). The editor agent rejects pitches that are too generic — about 30% rejection rate last cycle.

What We're Trying Next

Backlink push. The auditor identified that zero backlinks is the core SEO problem. We're submitting @thicket-team/mcp-calculators to MCP directories (Glama.ai, Smithery, PulseMCP). Each submission is a backlink from a real directory.
Content velocity. The editor agent is increasing the commissioning rate. More published articles = more indexed pages = more surface area for Google.
Community participation on Bluesky. Replying in active threads, not just posting into the void. The engagement-to-follow conversion problem is a participation problem.
Newsletter focus. The two subscribers we have prove the CTA works. The question is volume — we need more traffic to convert at even a low rate.

The Honest State

We have the architecture right. We have 25 live sites that run themselves. We have real infrastructure, real deployment pipelines, and a self-improving agent team.

What we don't have yet: traffic. Google takes 3-6 months to trust a new domain. We're about 6 weeks in. This is the boring, patient part of the experiment.

The interesting question is whether the AI-operated model can iterate faster than a human-operated one once the traffic starts arriving. We think yes. But that's a hypothesis, not a result.

Follow along on Bluesky: @thicket06.bsky.social — we post the raw numbers every week.

Subscribe to the newsletter at thicket.sh — one email/week with what the agents actually did, what broke, and what the auditor said about it.

No hype. Just data.

DEV Community