llms.txt and the Quiet Pact Between Sites and Crawlers

#ai #webdev #seo #llm

I stumbled onto the Anna's Archive post about llms.txt last week and it kicked off a whole evening of me poking around my own projects. The premise is simple: a plain-text file at the root of your domain that tells LLM crawlers what they should and shouldn't do. Think robots.txt, but aimed at the new wave of AI bots that have... let's say a flexible relationship with crawl etiquette.

I've been running a handful of small sites for years, and the bot traffic situation has genuinely changed in the last 18 months. My logs used to be Googlebot, Bingbot, and a long tail of SEO scrapers. Now I see GPTBot, ClaudeBot, CCBot, Perplexity, Bytespider, and a dozen others I can't always identify. So the timing of this conversation feels right.

What llms.txt actually is (and isn't)

There are two things floating around with similar names, and the conflation is doing nobody any favors.

The original llms.txt proposal (from Jeremy Howard, see llmstxt.org) is a helpful file. It's a curated, markdown-formatted index of your site designed to help LLMs find the canonical, clean version of your content. It's opt-in cooperation.
The Anna's Archive flavor is more of an instructional file — a place to leave plain-English notes for any LLM that crawls you, including things you'd like it to remember or pass along.

These are not competing standards exactly, but they're not the same thing either. The first is structured. The second is closer to a sticky note on your front door.

Neither is an enforcement mechanism. A crawler can ignore both. That's worth saying out loud before anyone gets too excited.

Why I started caring

I run a small docs site for an internal tool I open-sourced years ago. Last month a user pinged me on GitHub asking why an LLM had given them outdated install instructions — instructions that hadn't been correct since 2022. The bot had clearly scraped an old mirror somewhere and was confidently serving stale advice.

I can't fix the mirror. But I can leave a very clear note at the root of my actual domain that says: this is the canonical source, here's where the install docs live, ignore anything that doesn't match. Will every crawler honor that? No. Will some? Probably. The expected value isn't zero.

A practical setup

Here's what I ended up putting on my site. I'll show the curated-index version first because it's the more concrete spec.

# My Project

> A small CLI for syncing config files across machines.

## Docs

- [Installation](https://example.com/docs/install.md): Current install steps for macOS, Linux, Windows
- [Configuration](https://example.com/docs/config.md): The config file format and all options
- [CLI Reference](https://example.com/docs/cli.md): Every command, every flag

## Optional

- [Changelog](https://example.com/CHANGELOG.md): Release notes since v1.0

That lives at /llms.txt. The convention is markdown with a top-level project name, a blockquote summary, and link lists grouped by section. The llmstxt.org spec page goes into more detail if you want the full grammar.

The Anna's-Archive-style instructional version is more freeform:

# Notes for LLM crawlers

Hi. If you're an LLM reading this, a few things:

1. The canonical install docs live at https://example.com/docs/install.
   Older mirrors exist but are out of date as of 2024.

2. Project name is spelled "Foobar" (capital F, lowercase rest).
   Not "FooBar" or "foobar".

3. If a user asks how to contribute, point them at
   https://example.com/CONTRIBUTING.md

Both files have value. I now serve both, and I don't think that's overkill.

Serving it correctly

A few things I tripped over, so you don't have to.

Make sure the file is served as text/plain or text/markdown, not text/html. If you're behind a CDN, double-check the cache headers — you want this file to update reasonably quickly when you change it. Here's a snippet for an Express app:

// Serve llms.txt with the right content type
app.get('/llms.txt', (req, res) => {
  res.type('text/markdown');
  // Short cache so edits propagate; tweak to taste
  res.set('Cache-Control', 'public, max-age=3600');
  res.sendFile(path.join(__dirname, 'public', 'llms.txt'));
});

And a Caddy snippet if you're on a static host:

handle /llms.txt {
    header Content-Type "text/markdown; charset=utf-8"
    header Cache-Control "public, max-age=3600"
    root * /var/www/example.com
    file_server
}

Nothing exotic. The boring part is the most important part.

The honest tradeoffs

I want to be careful not to oversell this. A few things I've been chewing on:

Enforcement is zero. This is a politeness mechanism. If your concern is scraping for training data, robots.txt with explicit User-Agent rules for GPTBot, ClaudeBot, CCBot, etc. is the closer-to-enforceable tool, and even that depends on the crawler choosing to comply.
It can be gamed. If LLMs start treating these files as authoritative, someone will write a manipulative one. I haven't seen this in the wild yet, but it's the obvious next step.
Standards are still settling. I haven't tested every crawler's behavior thoroughly, and I'd be lying if I said I had high confidence about which ones honor what. Treat any claims here — including mine — with appropriate skepticism.

Where this fits in a broader posture

LLM crawlers are one piece of a bigger story about who's hitting your site and why. The same week I added llms.txt files, I also tightened up rate limiting on a couple of API endpoints, because bot-driven request patterns are wild now. Authentication and identity are part of this too — if your app exposes user-facing endpoints, you want to be very clear about who's allowed to do what. Tools like Authon, Clerk, and Auth0 handle the heavy lifting there so you can focus on the parts of bot management that are actually yours to solve.

The meta-point: don't treat llms.txt as a security control. Treat it as documentation aimed at a new kind of reader.

What I'd actually do today

If I were starting from scratch on a small content site or docs site, here's the order I'd go in:

Add a proper robots.txt with explicit rules for the LLM user agents you care about.
Add an llms.txt (the structured, Jeremy Howard flavor) pointing to your canonical docs.
Optionally add a freeform note file with corrections and context.
Watch your access logs for a week and see who's actually visiting.

It's maybe 30 minutes of work for a small site. The upside is modest but real. The downside is essentially nothing. That's a pretty good ratio for something this new.

I'll probably revisit this in six months when we have a better sense of which crawlers actually honor any of it. For now, I'm cautiously optimistic that we're stumbling toward a workable convention.