I Built an llms.txt Generator, Showed It to the Creator of the Standard, and Had to Rewrite Everything

#llms #ai #webdev #programming

This is a follow-up to my previous post — I shipped v1, got feedback from the creator of the standard, and had to rethink everything.

I'm not going to pretend I had a plan. I didn't.

There's a new standard called llms.txt — a file you place in your site root to help AI agents navigate your content. Think of it as robots.txt for AI agents: instead of telling crawlers what to ignore, it tells agents how to understand your site.

I saw other generators producing it. I built one too, shipped it, and thought I was done.

Then I talked to Jeremy Howard — the creator of the standard — and discovered I had fundamentally misunderstood what llms.txt is supposed to be. And then things got complicated.

Version 1: Copying What Everyone Else Was Doing

The pattern every generator was producing:

- [Page Title](https://example.com/page): AI-generated summary.
- [Another Page](https://example.com/other): AI-generated summary.

One URL → one summary. I did the same. I called this the Flat strategy and shipped it. It's still in the generator because people ask for it. But I no longer think it's what llms.txt is actually for.

The Conversation That Changed Everything

About six weeks after shipping I posted in the llms.txt Discord and asked Jeremy Howard directly.

Me:

For a site like Stripe with 4,000 pages, how do you balance curation vs completeness? If you curate down to 20 pages, an agent asking about webhook retries won't find the answer. If you include everything, it's just noise.

Jeremy Howard (source):

"Yeah it's actually a lot of work - it's not just writing an llms.txt, but writing complete agent-oriented docs. It's basically like a parallel web - one designed for AI! An llms.txt is just AI's index.html replacement - it contains links to other md pages, which themselves can have links."

And then the part that really got me — he wasn't just describing a different structure. He was saying that llms.txt shouldn't be generated automatically at all. It should be carefully curated by humans who understand their site.

I sat with that for a bit.

On one hand, he's right. On the other hand — Stripe has 4,000 pages. Nobody is writing those MD files by hand. And weather.com has 1,140,000 URLs. The math doesn't work for human curation at scale.

So I decided to try anyway. If it can't be done automatically, let's see how close we can get.

Me (source):

"That changes everything. The 'curation vs completeness' tension I was worried about disappears when you have progressive disclosure through a hierarchy."

So I went and built Version 2.

What Version 2 Actually Required

The vision was simple: semantically group related pages, synthesize each group into one MD file. An agent gets 30 focused documents instead of 4,000 individual pages.

The reality was five separate engineering problems, each hiding behind the previous one.

Problem 1 — Context windows. To group pages by meaning you need to read them. 4,000 pages of full text doesn't fit in any LLM context window. Solution: embeddings + k-means clustering. Each page becomes a vector, similar pages cluster together, the LLM only ever sees one cluster at a time.

Problem 2 — Money. Each LLM call was sending the same cluster content over and over as input tokens. For a cluster generating 5 MD files I was paying for those pages 5 times. Solution: Gemini Context Caching. Upload the cluster content once, reuse the cached reference for all subsequent calls within that cluster.

Problem 3 — Layers at different speeds. Crawling, embedding, and summarizing run at completely different speeds. Wire them together naively and they block each other — the crawler piles up work while the summarizer starves. Solution: in-memory buffers between each layer, with independent concurrency control per layer.

Problem 4 — The LLM is unreliable. At scale, everything fails eventually. Gemini returns 429, 503, valid JSON with the wrong number of items, invalid JSON, or just times out after 4 minutes. Solution: typed exception hierarchy for LLM failures, AIMD queue with proper backoff that reads actual retry-after values from Google's API instead of guessing.

Problem 5 — German spaces. For reasons I still don't fully understand, Gemini sometimes responds to German-language content with a valid response followed by approximately 2,000,000 spaces. This hits max token limits and crashes. Solution: 3 retries, then the task is dropped — but the order stays in the database and can be restarted from the frontend. Intermediate results are cached in Redis so restarting picks up where it left off without re-spending tokens.

Each problem only became visible after solving the previous one. That's how it goes.

The Unsolved Problem: Multilingual Sites

When I started, Stripe's docs had ~3,500 URLs. Then they added German translations and it jumped to ~4,300. The generator processed everything and started producing German-language MD files for what should be an English documentation site.

There's no universal logic to detect canonical single-language content from a sitemap. Every site has its own URL scheme. Planned fix: a URL pattern filter as an order parameter — you specify which patterns to include, the crawler respects it.

An Observation About the Future

While building this I noticed something: there are already tools that generate complete human-readable documentation sites from markdown source files.

This opens up an interesting workflow:

Generate llms.txt + structured markdown (the AI layer)
Generate the human-facing website from those same markdown files

The markdown becomes the source of truth. Write once, publish twice — one version for humans, one for AI agents.

Epilogue

I don't know if this will ever make me a dollar.

But since building this, people started reaching out on LinkedIn. Interview invitations started coming in. I talked to Microsoft today.

And I find it genuinely funny when I get coding assessments that ask me to build a notification service — clean code, scalable architecture, perfect database design, dockerized, fully tested — in 3 days.

I spent several months building something like what you just read about, and I'm still not sure I got it right.

Those 3-day assessments make me laugh.

*Want to understand how this actually works under the hood — embeddings, clustering, context caching, the AIMD queue? I wrote a separate technical deep-dive: How I Built an llms.txt Generator That Actually Works at Scale

The generator is at llmstxtgenerator.svcpool.com. Free tier up to 5,000 pages.

Full spec: llmstxt.org.