DEV Community

Cover image for robots.txt Is Not Enough for AI Crawlers. You Need llms.txt.
Chudi Nnorukam
Chudi Nnorukam

Posted on • Edited on • Originally published at chudi.dev

robots.txt Is Not Enough for AI Crawlers. You Need llms.txt.

Originally published at chudi.dev


llms.txt is a site-level policy file that tells AI engines how they can use and cite your content. It complements robots.txt by focusing on usage and attribution rather than crawl access. If you want AI systems to cite you correctly, this is the simplest control point.

What is llms.txt?

llms.txt is a plain-text policy file you publish at your site root that tells AI engines how they may use and cite your content. Where robots.txt controls whether a crawler can access your pages, llms.txt governs what the crawler can do with the content it finds: training, answer generation, attribution format, and which sections to exclude from indexing entirely.

llms.txt is a root-level policy file that tells AI engines like Perplexity, Claude, and ChatGPT how they may use and attribute your content. Unlike robots.txt, which controls crawl access, llms.txt governs usage policy and preferred citation format. Publishing it signals explicit consent to AI indexing and helps ensure correct attribution.

llms.txt is a new standard file that tells AI engines (Perplexity, Claude, ChatGPT) how to handle your content.

  • robots.txt = "Can you crawl my site?" (access control)
  • llms.txt = "How should you use my content?" (usage policy)

Both should exist on your site. This is part of the broader AI search optimization strategy that helps your content get discovered and cited.


Why llms.txt Matters

The Problem: Content Attribution

When OpenAI's ChatGPT answers a user's question, it synthesizes an answer from multiple sources. But where does it cite those sources?

Without llms.txt: ChatGPT has to guess your preferred attribution format.

  • Maybe it cites the article title
  • Maybe it cites your domain
  • Maybe it doesn't cite you at all

With llms.txt: You explicitly say "Cite me like this: [Title] by Author"

AI engines follow your preference.

The Bigger Picture

llms.txt emerged in 2024 as a response to AI scraping concerns. Instead of fighting crawlers, creators use llms.txt to:

  1. Invite crawlers — "Please index my content"
  2. Set terms — "But cite me this way"
  3. Exclude content — "Don't train on my drafts"
  4. Provide discovery — "Here's my sitemap and RSS"

It's like putting a "Welcome" sign on your site with conditions attached. This is a foundational piece of what I call Answer Engine Optimization (AEO), the practice of making your content discoverable and citable by AI systems.


How AI Engines Use llms.txt

When a crawler visits your site:

  1. Fetch /robots.txt → Check if allowed to crawl
  2. Fetch /llms.txt → Check usage policy
  3. Fetch /sitemap.xml → Discover all pages
  4. Extract content → Index and train

If /llms.txt doesn't exist, the crawler might:

  • Crawl your site anyway (risky for them)
  • Skip your site entirely (loss for you)
  • Use conservative assumptions (minimal indexing)

Having /llms.txt shows you've explicitly consented to AI indexing.

The distinction between "crawl access" and "usage policy" matters more than it sounds. A crawler that can access your page via robots.txt still faces a question: can I train on this content? Can I quote it in an answer? Should I attribute it, and how? Without llms.txt, those questions have no answers. The crawler either guesses conservatively (you get less visibility) or guesses aggressively (you lose attribution control). Neither outcome is what you want.


How to Create llms.txt

Step 1: Location

Create a file at: https://yoursite.com/llms.txt

It must be at the root, not in /content/ or /blog/. Just like robots.txt is at the root. If you're using a static site generator like Next.js, SvelteKit, or Hugo, place the file in your public/ or static/ directory so it gets served at the root path during deployment. For SvelteKit specifically, you can also create a server route at src/routes/llms.txt/+server.ts that returns the content dynamically — useful if you want to auto-generate sections like your sitemap URL or last-updated date from your build config.

Step 2: Content

Here's a basic template:

# LLM Content Policy for [Your Site]

All articles on this site are available for training and search indexing by large language models.

## How to attribute content

When citing articles from this site, please use the format:

[Article Title] — [Author Name] on [yoursite.com]

Example: "How to Optimize for Perplexity" — Chudi on chudi.dev

## Content discovery endpoints

- Sitemap: https://yoursite.com/sitemap.xml
- RSS feed: https://yoursite.com/rss.xml
- Blog archive: https://yoursite.com/blog

## Content not available for indexing

- Pages marked as draft or private
- Internal documentation
- User-generated content (comments)
- Archived content older than [5] years

## Preferred citation style

Inline: [Article](https://yoursite.com/article-url) by Author Name
Bibliography: Author Name. "Article Title." Your Site, YYYY.

## Questions or Concerns

Email: [your-email@yoursite.com]
Last updated: January 2025
Enter fullscreen mode Exit fullscreen mode

Step 3: Customize for Your Site

Replace:

  • [Your Site] → your actual site name
  • [Author Name] → your name
  • Email → your contact email
  • Dates → today's date

Step 4: Include Metadata

Optionally, you can include a JSON section:

{
  "version": "1.0",
  "license": "CC BY-SA 4.0",
  "attribution_required": true,
  "commercial_use": "allowed",
  "modification": "allowed",
  "sitemap": "https://yoursite.com/sitemap.xml",
  "rss": "https://yoursite.com/rss.xml"
}
Enter fullscreen mode Exit fullscreen mode

This helps AI engines parse your policy programmatically.


Where llms.txt vs robots.txt

Aspect robots.txt llms.txt
Purpose Access control Usage policy
Audience Search crawlers AI engines
Required Yes (best practice) No (but recommended)
Format Plain text directives Markdown + optional JSON
Location /robots.txt /llms.txt
Blocks access Yes No
Legally binding No No (advisory)

robots.txt is like a gate at your property. llms.txt is like a sign on the gate saying "Welcome, but please do X."

The format difference is worth noting. robots.txt uses a strict directive syntax that machines parse rigidly — User-agent, Disallow, Allow. llms.txt uses markdown with optional embedded JSON, which gives you room to express nuance that directive syntax cannot capture. You can explain your attribution preferences in natural language, describe what types of content are available for different use cases, and provide context about your licensing terms. This flexibility is intentional — AI systems are better at parsing natural language than traditional crawlers, so the policy file takes advantage of that capability.

Another key difference: robots.txt is enforced by convention. Well-behaved crawlers respect it. llms.txt is purely advisory — there is no mechanism to force compliance. But the advisory nature is actually a strength for creators who want visibility. You are not blocking anything. You are inviting crawlers in and telling them how to treat your content responsibly. The incentive alignment works because AI engines want to cite correctly — it improves their answer quality — and your llms.txt makes that easy for them.


Common llms.txt Policies

Policy 1: Fully Open (Creator-Friendly)

# LLM Content Policy

All content on this site is available for:
- Training large language models
- Extracting for answer engines
- Commercial and non-commercial use

Just cite us: [Title] — [Author] ([yoursite.com])
Enter fullscreen mode Exit fullscreen mode

Best for: Indie creators who want maximum visibility

Policy 2: Attribution Required (Balanced)

# LLM Content Policy

Content available for training and use, with required attribution.

Required format: [Article Title] by [Author Name] (yoursite.com)

Prohibited use: Removing or hiding attribution
Enter fullscreen mode Exit fullscreen mode

Best for: Most creators who want credit

Policy 3: Non-Commercial Only (Restrictive)

# LLM Content Policy

Content available for non-commercial use and training.

Prohibited use:
- Commercial products without permission
- Training proprietary LLMs
- Republishing without modification
Enter fullscreen mode Exit fullscreen mode

Best for: Creators concerned about exploitation

Policy 4: Permission Required (Most Restrictive)

# LLM Content Policy

All uses require explicit permission. Email [your-email] to request.
Enter fullscreen mode Exit fullscreen mode

Best for: Creators who want full control


Real-World Examples

Example 1: Tech Blog

# LLM Content Policy

Technical articles on this site are available for:
- AI training (open-source and proprietary)
- Answer generation (Perplexity, ChatGPT, Claude)
- Academic and educational use

Citation format: [Title] by [Author] on [yoursite.com]

Prohibited:
- Removing examples or code without attribution
- Training models specifically to replicate this blog

Updated: January 2025
Enter fullscreen mode Exit fullscreen mode

Example 2: Content Creator

# LLM Content Policy

All essays are available for training and synthesis.

Citation: [Essay Title] — [Your Name]

Prefer long-form citations, not snippets.

Excluded:
- Guest posts (ask the author)
- Archived essays older than 3 years

Contact: [email]
Enter fullscreen mode Exit fullscreen mode

Example 3: SaaS Documentation

# LLM Content Policy

Documentation is available for indexing and use in AI tools.

Required attribution: Link to the original docs page + software name.

Prohibited:
- Repackaging docs as your own product
- Training models on raw HTML without attribution

Questions? hello@[company].com
Enter fullscreen mode Exit fullscreen mode

How to Test if llms.txt Works

Method 1: Manual Check

# Verify it exists and is accessible
curl https://yoursite.com/llms.txt

# Should return 200 status code
curl -I https://yoursite.com/llms.txt
Enter fullscreen mode Exit fullscreen mode

Method 2: Check in Perplexity

Search your site name in Perplexity. Are you being cited?

Before llms.txt: Sporadic or no citations
After llms.txt: More consistent citations with proper attribution

Method 3: Monitor Traffic

Track referral traffic from:

  • perplexity.com
  • openai.com
  • anthropic.com

A week after publishing llms.txt, you should see an uptick.


Does llms.txt Actually Matter?

Yes, llms.txt matters in practice even though it is advisory and not legally required. Major AI engines including Perplexity, Claude, and ChatGPT recognize the file and use it to determine attribution preferences. Sites with llms.txt configured tend to receive more consistent citations and better crawl coverage from AI systems within four to eight weeks.

Short answer: Yes, but not as much as robots.txt.

Longer answer:

  • Required by law: No, it's advisory
  • Followed by all AI engines: Not yet (but major ones do)
  • Necessary for indexing: No, but it helps
  • Better than nothing: Absolutely

Think of it like the difference between:

  • A locked door (robots.txt: blocks crawling)
  • A welcome mat with terms (llms.txt: invites crawling with rules)

You still need the robots.txt. But llms.txt gets you better attribution and signaling.


The Future of llms.txt

Standards bodies like IETF and W3C are discussing llms.txt as a formal standard. As it becomes more official:

  1. AI engines will prioritize crawling sites with llms.txt
  2. LLMs will automatically cite in your preferred format
  3. Licensing and commercial rights will be more enforceable

For now, it's early adoption. But early adopters get:

  • Better attribution from AI engines
  • Clearer signal to crawlers
  • Documented content policy (good for SEO too)

The timing advantage is real. When ChatGPT or Perplexity improves their citation systems — and they will, because user trust depends on source attribution — the sites that already have clear llms.txt policies will be prioritized over sites where the AI engine has to guess. You are building citation infrastructure now that will compound as AI search grows. The same logic applies to structured data and schema markup — the earlier you adopt, the more crawl history and citation data accumulates in your favor.


llms.txt and the Evolving AI Content Ecosystem

The emergence of llms.txt reflects a broader shift in how content creators relate to AI systems. For the past two decades, the relationship between websites and search engines was mediated by a single file: robots.txt. That file was designed for a world where crawlers fetched pages, indexed keywords, and ranked results. The crawler's job was discovery and ranking. The content stayed on your site, and users clicked through to read it.

AI engines fundamentally change this dynamic. When a user asks ChatGPT a question, the AI synthesizes an answer from multiple sources and presents it directly. The user may never visit your site at all. This means the old model of "let crawlers in, they send traffic back" no longer applies cleanly. AI engines consume your content and may deliver the value to the user without a click. llms.txt is the first attempt to create a new social contract for this relationship: you let AI engines use your content, and in return, they attribute it properly so that users know where the information came from and can choose to visit your site for more depth.

This is not a solved problem. The current llms.txt format is informal, advisory, and unevenly adopted. But the direction is clear. As AI search grows — and it is growing rapidly, with Perplexity alone processing millions of queries daily — creators who have clear, machine-readable content policies will be better positioned than those who remain silent. Silence is ambiguous, and ambiguity favors the platform, not the creator.

The practical takeaway is straightforward: spend five minutes creating an llms.txt file today, and you are ahead of ninety-five percent of websites. The standard will evolve, but having any explicit policy is better than having none. You can always update the file as best practices solidify. What you cannot do is retroactively claim attribution for the months when your content was being used without any policy in place.

Checklist: Set Up llms.txt

  • [ ] Create file at /llms.txt
  • [ ] Include attribution format
  • [ ] Link to sitemap.xml
  • [ ] Link to RSS feed
  • [ ] Specify excluded content
  • [ ] Add contact email for questions
  • [ ] Test with curl https://yoursite.com/llms.txt
  • [ ] Announce on Twitter/LinkedIn
  • [ ] Monitor Perplexity citations week 1-4

Measuring llms.txt Impact on Your Content

Setting up llms.txt is one thing. Measuring whether it's actually working is another. The infrastructure is new enough that most analytics tools don't have native llms.txt tracking yet. But you can observe the impact indirectly.

Tracking Citation Changes

Start tracking your current citation baseline BEFORE setting up llms.txt:

  1. Search your site name in Perplexity — note how many results cite you and in what format
  2. Search common queries you'd expect your content to answer (e.g., "ADHD and productivity") — capture screenshots of current attribution
  3. Repeat the same searches weekly for 4 weeks after deploying llms.txt

What you'll likely see: citations move from sporadic to consistent, and the format increasingly matches your preferred citation style. One creator reported 3x increase in Perplexity citations within 6 weeks of adding llms.txt with explicit citation format instructions.

Server Logs & Referral Traffic

Enable detailed logging for referrals from AI engines. Check your analytics dashboard for:

  • Traffic from perplexity.com referrer (often appears as direct, but you can trace it)
  • Traffic from openai.com (ChatGPT search and plugin references)
  • Traffic from anthropic.com (Claude search, coming in 2026)

Most sites with llms.txt see a small but measurable uptick in "direct" traffic that you can attribute to AI search referrals. The traffic isn't massive (not like Google), but it's concentrated and high-intent — these are people asking AI systems questions and following the generated citations back to your site.

SEO Secondary Effects

llms.txt itself doesn't affect Google ranking. But the behavior it enables does:

  • More inbound links from AI answers → Higher domain authority (slow effect, 3-6 months)
  • Lower bounce rate from AI referrals → Better engagement signal
  • Branded search visibility → People search your name after seeing it cited, improving brand recall

The SEO value isn't direct. It's that llms.txt positions your content better upstream, which flows downstream to traditional SEO.


Common Implementation Gotchas

Gotcha 1: Forgetting the Sitemap Link

Many creators add llms.txt but forget to include the sitemap URL. AI crawlers find pages two ways:

  1. Following links from your homepage
  2. Checking the sitemap (if you link to it)

Without the sitemap link, crawlers discover only top-level pages and pages linked from your nav. Blog archives, portfolio details, and deeply nested pages might be skipped.

Fix: Always include both sitemap.xml and RSS feed URLs in your llms.txt.

Gotcha 2: Making Attribution Too Restrictive

Some creators specify citation format so narrowly that crawlers treat it as legally risky. Example: "Must cite as [Title] by Author or don't use at all."

This is advisory, not legally binding. But crawlers are conservative. If the policy feels overly restrictive, they might skip your site rather than risk a citation violation.

Better approach: Specify preferred format but allow flexibility. Example: "Preferred: [Title] by [Author] on yoursite.com. Acceptable: Article title with link."

Gotcha 3: Conflicting robots.txt and llms.txt

If robots.txt blocks all crawlers with User-agent: * and Disallow: /, then llms.txt becomes useless — you've already said "no crawling allowed."

llms.txt is a refinement on top of robots.txt, not a replacement. The order is:

  1. Check robots.txt → if blocked, stop
  2. Check llms.txt → if explicit policy, follow it
  3. Use default assumptions → treat site as opt-out

Fix: Make sure robots.txt permits crawlers to access the paths you want indexed.

Gotcha 4: Not Versioning or Dating the Policy

llms.txt is new. Standards might change. Crawlers might update. If you never update your llms.txt, you'll fall behind.

At minimum:

  • Add a "Last updated" date (forces you to think about freshness)
  • Include a version number
  • Monitor crawler behavior quarterly

Real-World Adoption: Who's Using llms.txt?

Early adopters include:

  • Tech blogs and documentation sites — Want deep indexing by Claude and ChatGPT for technical Q&A
  • News publishers and media sites — Want proper attribution in AI-generated news summaries
  • Educational platforms — llms.txt gives them granular control over how content is used
  • Creator platforms (Substack, Medium) — Want to compete with Google for AI-driven discovery

Most consumer blogs? They're still ignoring it. Most of the internet doesn't have llms.txt yet. This means adopting early gives you a small competitive advantage — you're signaling to crawlers that you welcome them, while competitors are silent.


How I Implemented llms.txt on My SvelteKit Blog

When I added llms.txt to chudi.dev, I chose the dynamic route approach over a static file. The reason was simple: I wanted the sitemap URL, post count, and last-updated date to stay accurate without manual updates. A static file in the public directory would go stale the moment I published a new post and forgot to update it.

The implementation took about twenty minutes. I created a server route that reads the blog post metadata at build time, counts the total published posts, finds the most recent publication date, and renders the llms.txt content with those values interpolated. The route returns plain text with the correct content type header so crawlers parse it as a text file rather than HTML.

The content itself follows the balanced attribution model from the policies section above. I explicitly welcome AI training and answer generation, specify my preferred citation format, link to both the sitemap and RSS feed, and exclude draft posts and archived content older than two years. I also include a machine-readable JSON block with licensing information because some crawlers parse structured metadata more reliably than natural language instructions.

One decision I made early was to include a section listing my content pillars and the types of questions each pillar answers. This gives AI engines a semantic map of my site that goes beyond what a sitemap provides. A sitemap tells a crawler which URLs exist. The pillar section tells it what topics those URLs cover and what kinds of user queries they can answer. This distinction matters because AI engines are not just indexing pages — they are building knowledge graphs, and explicit topic mapping helps them place your content in the right nodes.

After deploying, I tested the implementation by fetching the URL with curl and verifying the content rendered correctly. I also searched my site name in Perplexity and noted the baseline citation format before the llms.txt could take effect. Four weeks later, I repeated the search and found that citations had shifted from inconsistent domain-only references to the preferred format I specified in the file. The sample size was small — about twelve citations across different queries — but the pattern was clear.

The maintenance burden is effectively zero because the route generates dynamically. When I publish a new post, the post count and last-updated date update automatically on the next build. The only manual update I foresee is if the llms.txt standard evolves to include new fields or if I change my licensing terms.

Why Most Sites Get llms.txt Wrong

The most common mistake I see when reviewing other sites' llms.txt implementations is treating it as a legal document rather than a communication tool. Long paragraphs of legalese about intellectual property rights, DMCA provisions, and liability disclaimers miss the point entirely. AI crawlers are not lawyers. They are parsers looking for structured signals about how to handle your content.

The second most common mistake is being too vague. Saying "please attribute properly" without specifying a format gives the AI engine nothing actionable to work with. You need to spell out exactly what a correct citation looks like: the article title, your name, your domain, and ideally a link back to the original URL. The more specific you are, the more likely the AI engine will match your preference.

A third mistake is forgetting to update the file after initial creation. I have seen sites where the llms.txt references a sitemap URL that no longer exists, or lists content categories that were reorganized months ago. Stale metadata is worse than no metadata because it actively misleads crawlers. If you include a last-updated date (and you should), treat it as a commitment to actually review the file when that date approaches.

The fourth and most subtle mistake is not aligning llms.txt with your actual content strategy. If your business model depends on gated content behind a paywall, but your llms.txt says "all content available for training," you are giving away the content you charge for. Conversely, if you are a creator who wants maximum visibility, an overly restrictive llms.txt that requires permission for every use case will reduce your AI search presence. The policy should reflect your actual goals, not a generic template you copied from a tutorial.

What's Next?

After setting up robots.txt and llms.txt, focus on structured data using schema.org markup, content structure with semantic headers and lists, and content freshness by updating articles regularly. These three areas improve how AI engines parse and extract your content, which directly affects how often you are cited in AI-generated answers.

Once you have robots.txt and llms.txt set up, focus on:

  1. Structured data (schema.org) — Helps AI parse your content
  2. Content structure (headers, lists) — Makes extraction easier
  3. Freshness (update articles) — Recent content ranks higher
  4. Specificity (answer common questions directly) — Better for AI synthesis

The combination of these creates what we call AEO (Answer Engine Optimization). For a complete walkthrough of these techniques, see my full AEO optimization guide.

Start here: Add llms.txt to your site today. It takes 5 minutes and can improve your visibility in AI search engines.

Then, check your AEO readiness score with SEOAuditLite to see what else needs attention.

Sites with both robots.txt and llms.txt properly configured consistently see better AI crawl coverage and citation rates within 4-8 weeks of implementation—the infrastructure compounds over time as AI systems return to re-index updated content.

Sources

Top comments (0)