DEV Community

Cenk KURTOĞLU
Cenk KURTOĞLU

Posted on

I built a batch llms.txt generator (make any site AI-readable, at scale)

robots.txt told crawlers what to index. llms.txt tells LLMs what your site means.

If you haven't run into it yet: llms.txt is an emerging standard — a single Markdown file at your domain root that gives AI models a clean, structured map of your site. Not for Googlebot. For ChatGPT, Claude, Perplexity, and the growing swarm of AI agents that read the web at inference time.

Two reasons it stopped being a curiosity in 2026:

  1. RAG ate everything. Retrieval-augmented generation is the default architecture for knowledge apps, and llms.txt / llms-full.txt files slot straight into RAG pipelines — pre-structured, dense, low-noise.
  2. Google put it in Lighthouse. As of May 2026, llms.txt is part of Chrome Lighthouse's new Agentic Browsing audit — an AI-readiness check. That's a strong signal it's becoming table stakes, not a nice-to-have.

What the file actually looks like

llms.txt is deliberately simple — a title, a one-line summary, and a linked index of your key pages:

# Your Product

> One sentence on what this site is and how an LLM should reference it.

## Pages
- [Getting Started](https://example.com/docs): Install and first run in 5 minutes.
- [API Reference](https://example.com/api): Endpoints, auth, rate limits.
- [Pricing](https://example.com/pricing): Plans and limits.
Enter fullscreen mode Exit fullscreen mode

There's a companion llms-full.txt that inlines the full content of each page — the version RAG pipelines actually ingest.

The gap: doing it for many sites, programmatically

There are nice one-off web tools where you paste a URL and get a file. Great for a single site. But if you're an agency (every client needs one), building a RAG pipeline (dozens of sources), or running this on a schedule, you want an API, batch runs, and per-use pricing — not a copy-paste UI.

So I built an Apify Actor that does exactly that:

llms.txt Generator on Apify →

  • Point it at a URL → it crawls same-domain pages (up to your limit).
  • Extracts titles, meta descriptions, and clean main content (<article>/<main> aware, strips nav/footer/scripts).
  • Handles SPAs by reading the sitemap for full coverage.
  • Outputs llms.txt + llms-full.txt to the run's key-value store.
  • API-callable and batch-friendly — generate for 200 client sites in one workflow.

Input is just:

{ "websiteUrl": "https://example.com", "maxPages": 100, "includeFullText": true }
Enter fullscreen mode Exit fullscreen mode

Then you drop the generated llms.txt at your domain root, same convention as robots.txt.

It's fully open-source too, if you'd rather read or run the crawler yourself: github.com/cekuu35/llms-txt-generator

Should you add an llms.txt to your site?

If you have docs, a marketing site, or anything you'd want an AI assistant to summarize correctly: yes. It's low-effort, it's becoming a readiness signal, and it costs you nothing to be the site an LLM can actually parse instead of guessing.

If you try it on a big site or wire it into a pipeline, I'd genuinely like to hear what breaks — drop a comment and I'll take a look.

Top comments (0)