What Our llms.txt Is, and Why We Publish It

#meta #blogging #webdev

If you fetch https://pickuma.com/llms.txt, you get a plain-text file: a one-line description of the site, a short About section, then every published article grouped by category with its title, URL, and a single-sentence summary. Fetch https://pickuma.com/llms-full.txt and you get something heavier — the entire article corpus in source form, newest first, with the MDX components stripped out so a model reads prose instead of markup.

These two files exist because language models read the web differently than people do. A person lands on a page, scans the headings, and bounces between links. A crawler feeding a model wants structure it can parse without rendering JavaScript or guessing which <div> holds the article. The llms.txt convention — proposed at llmstxt.org — gives it that structure in a format that's trivial to fetch and cheap to tokenize.

What the two files actually contain

The split is deliberate. The index file is small — a few hundred lines — and it's a map, not the territory. It opens with the site summary, links to our editorial standards, affiliate disclosure, privacy page, and tool directory, then lists each article as a bullet: [title](url): description. A model that wants to know what Pickuma covers can read the whole thing in one request and decide what's worth fetching.

The full corpus is the territory. At roughly 29,000 lines it concatenates every non-draft post, each prefixed with a small frontmatter block:

---
url: https://pickuma.com/for-dev/some-slug/
title: "The article title"
category: ai-dev-tools
published: 2026-05-30
---

Then the title, the description, and the body — with imports and components removed. A <CtaCard> or <ComparisonTable> in the source becomes nothing in the corpus, because a model doesn't need our Astro components; it needs the sentences around them.

Both files are generated, never hand-edited. A pre-build script reads every .mdx file in the posts directory, parses the frontmatter, skips anything marked draft: true, sorts by publish date, and writes both outputs before Astro copies them into the deployed site. That means they can't drift from what's actually published — every build regenerates them from the same source the live pages render from.

Because it's generated from the same content tree, there's no separate "AI version" of the site to maintain and no risk of the machine-readable copy saying something the human-readable pages don't. Publish an article, run the build, and it's in both files automatically.

Why a review site hands its content to models

The obvious objection: if you give your full article text to AI crawlers, won't models answer questions directly and skip your site? That's a real tension, and it's worth being honest about. We publish anyway, for three concrete reasons.

First, models are already reading the open web — the only question is whether they read a clean version or a guessed-at one. Without llms.txt, a crawler still ingests our pages; it just does so by scraping rendered HTML, stripping navigation and ad slots imperfectly, and sometimes attributing the wrong text to the wrong article. The structured file removes the guesswork. If our content is going to inform an answer, we'd rather it be the accurate version with the right URL attached.

Second, attribution travels with the text. Every entry in both files carries the canonical URL. When an assistant cites a source or a user asks "where did this come from," the link back to Pickuma is right there in the data the model read. Clean source data is the closest thing to a citation guarantee you get in an AI-mediated web.

Third, it matches how we already work. Our editorial standard is that every reviewed tool is tested in real workflows and affiliate disclosures are inline. Publishing the corpus is the same posture applied to machines: here's everything, here's how it's labeled, here's where it lives. A site that hides its content from crawlers while claiming transparency to readers is telling two different stories.

The llms.txt standard is young and not yet universally honored — plenty of crawlers ignore it, and there's no enforcement layer. We treat it the way we treat a sitemap or an RSS feed: a low-cost, well-specified signal that costs us nothing extra to emit because it falls out of the build we already run. If the convention gains traction, we're already compliant. If it doesn't, we've lost a few kilobytes of generated text.

You can do this for your own site in an afternoon. The recipe is small: walk your content directory, emit a Markdown index that follows the llmstxt.org shape, and optionally emit a second file with full bodies. The only discipline that matters is generating it at build time from your real content so it never goes stale.

What this means if you publish content

If you run a blog, docs, or any content site, the practical takeaway is that machine-readability is now a publishing concern, not a future one. You don't need to rewrite anything. You need a build step that exposes your existing content in a format a model can fetch and parse — and a decision about whether you want that content read cleanly or scraped messily.

We came down on the side of clean. The bet is that accurate, attributed source data serves us better over time than withholding text that crawlers will collect anyway. Whether that bet pays off depends on how the standard evolves and how models handle attribution — neither of which we control. What we control is the quality of what we hand over, and that's the part worth getting right.

Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.