Muhammad Awais

Posted on Jun 10 • Originally published at webtoolshub.online

llms.txt Explained: The robots.txt for AI Search (With Next.js Code)

#seo #nextjs #ai #webdev

You've got robots.txt. You've got sitemap.xml. In 2026, there's a third file you should have at your site root-and most developers still haven't added it.

It's called llms.txt. Here's what it is, why it matters, and how to ship it in your Next.js project today.

What's the Problem It Solves?

AI assistants like ChatGPT, Perplexity, and Claude don't crawl your entire site when answering a query. They have limited context windows and struggle with:

JavaScript-heavy HTML with nav, footers, cookie banners
Sites that return content only after hydration
Pages where the actual content is buried under boilerplate

Result: AI tools describe your product incorrectly, miss your best pages entirely, or just say "I'm not sure."

llms.txt is a plain Markdown file at your site root that gives AI models a curated, machine-readable map of your most important content. Think of it as a cheat-sheet you write for AI, not for humans.

The File Format (Strict Spec)

The spec was proposed by Jeremy Howard (Answer.AI) in September 2024. The format is strict — don't improvise it:

# Site Name

> One sentence. This is required by the spec. Make it count.

## Section Name

- [Page Title](https://yoursite.com/page): One line description of this page.
- [Another Page](https://yoursite.com/other): One line description.

## Data Usage Policy

Real-time AI search and citation is permitted.
AI model training requires written permission.

Required elements:

# H1 — your site name, nothing else
> blockquote — mandatory one-sentence elevator pitch immediately after H1
## H2 sections — group your links logically
- [Title](url): description — each link needs a colon + description

What to include: 5–15 of your most important pages. Not a sitemap dump. Curate.

File size target: Under 5KB. AI models prioritise density.

How It Fits With robots.txt and sitemap.xml

These three files serve completely different purposes:

File	Audience	Purpose
`robots.txt`	All crawlers	Access control — what bots can/can't fetch
`sitemap.xml`	Search engines	Discovery — "these URLs exist"
`llms.txt`	AI models	Comprehension — "here's what my site is about"

You need all three. They don't replace each other.

The Training Bot vs Answer Bot Distinction (Critical)

This is where most devs get it wrong. In 2026 there are two fundamentally different types of AI crawlers:

Training Bots (scrape content for model datasets)

GPTBot          → OpenAI model training
Google-Extended → Gemini training
Meta-ExternalAgent → LLaMA training
Applebot-Extended  → Apple Intelligence

Answer Bots (real-time retrieval for live queries)

PerplexityBot  → Perplexity AI answers
OAI-SearchBot  → ChatGPT Search
ClaudeBot      → Claude real-time search
DuckAssistBot  → DuckDuckGo AI

Blocking training bots = your content doesn't get scraped for model training. Reasonable if you care about IP.

Blocking answer bots = your site won't appear in AI-generated answers. This actively hurts you.

Here's the correct robots.txt setup for "block training, allow answer bots":

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: PerplexityBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: OAI-SearchBot
Allow: /

User-agent: DuckAssistBot
Allow: /

User-agent: *
Allow: /

Sitemap: https://yoursite.com/sitemap.xml
Sitemap: https://yoursite.com/llms.txt

👆 That last line is intentional — referencing llms.txt as a supplemental sitemap entry helps AI crawlers discover it faster since they always read robots.txt first.

Next.js Deployment — Two Ways

Option 1: Static File in /public (Quickest)

Just drop your generated llms.txt into /public. Next.js serves everything in /public at the root path.

your-project/
├── public/
│   ├── llms.txt    ← here
│   └── robots.txt
├── app/
└── ...

Done. Accessible at https://yoursite.com/llms.txt immediately.

Option 2: Route Handler (Better for Dynamic Sites)

Create app/llms.txt/route.ts:

import { NextResponse } from "next/server";

// Your important pages — update this array when you add new content
const importantPages = [
  {
    title: "Robots.txt & LLMs.txt Generator",
    url: "https://www.webtoolshub.online/tools/robots-txt-llms-txt-generator",
    description:
      "Generate spec-compliant robots.txt and llms.txt files visually with 2026 AI bot presets.",
  },
  {
    title: "JSON to TypeScript Converter",
    url: "https://www.webtoolshub.online/tools/json-to-ts",
    description: "Paste JSON and instantly get accurate TypeScript interfaces.",
  },
];

export async function GET() {
  const links = importantPages
    .map((p) => `- [${p.title}](${p.url}): ${p.description}`)
    .join("\n");

  const content = `# WebToolsHub

> Free developer tools that run entirely in your browser — no data leaves your device.

## Core Tools

${links}

## Data Usage Policy

Real-time AI search and citation is permitted.
AI model training requires written permission.
`;

  return new NextResponse(content, {
    headers: {
      "Content-Type": "text/plain; charset=utf-8",
      "Cache-Control": "public, max-age=86400, stale-while-revalidate=3600",
    },
  });
}

This approach regenerates on each deploy so it stays in sync with your actual content. The stale-while-revalidate header lets CDN edge nodes serve the cached version while fetching a fresh one in the background — important if your llms.txt is slow to compute.

Verifying Your Setup

Step 1: Open a private browser window and visit https://yoursite.com/llms.txt. You should see raw Markdown text — not an HTML page, not a 404.

Step 2: Check https://yoursite.com/robots.txt. Verify the AI bot rules look correct. Check for any accidental Disallow: / under User-agent: * that might be blocking everything.

Step 3: Test actual AI citation — search for your site's main topic on Perplexity and see if you appear and whether the description is accurate.

Common gotcha: If you're on Vercel and have ISR (Incremental Static Regeneration) configured, make sure your llms.txt route isn't accidentally picking up a stale cache from a previous deployment with different content.

Don't Make These Mistakes

❌ Listing 200+ URLs with no descriptions (not a sitemap)
❌ Vague descriptions: "Free tool for developers"
❌ Missing the required blockquote after the H1
❌ Using <h1> instead of # markdown heading
❌ Blocking PerplexityBot/ClaudeBot in robots.txt while trying to get AI citations
❌ Setting Crawl-delay: 30 which times out real-time answer bots
❌ Different content at /llms.txt vs what's actually on your pages

✅ 5–15 curated pages with specific one-line descriptions
✅ Blockquote is a single sentence — your actual elevator pitch
✅ robots.txt explicitly allows answer bots by User-Agent string
✅ File accessible at the domain root (not /public/llms.txt in the URL)
✅ Supplemental sitemap reference in robots.txt pointing to /llms.txt

Generate Both Files Without Writing Markdown by Hand

If you'd rather not write these manually, I built a free tool for this: Robots.txt & LLMs.txt Generator on WebToolsHub.

What it does:

Generates both files simultaneously in real time as you type
Visual interface for adding pages + descriptions
One-click presets: Allow All / Block Training / Block All AI / Custom
Handles all 7 major AI bot User-Agent strings for 2026
Download both .txt files with a single click
Runs 100% client-side — zero server calls, no signup

Does It Actually Work? (Honest Take)

A 300,000-domain study by SERanking found no statistically significant correlation between llms.txt adoption and improved AI citation frequency — as of early 2026. Google's John Mueller has confirmed that major crawlers currently prioritise standard HTML parsing over llms.txt.

So the honest answer is: not proven to directly boost citations yet.

But here's my reasoning for implementing it anyway:

Cost is ~10 minutes. There's no real downside.
Early adopters matter. Anthropic, Stripe, Cloudflare, Cursor, Vercel — these aren't companies that ship things without reason.
The autonomous agent use case is real right now. AI agents that browse on behalf of users need structured context. llms.txt is the only standardised way to provide it.
Standards follow adoption. structured data, Core Web Vitals, mobile-first indexing — none of these were "confirmed ranking factors" when the smart teams started implementing them.

The question isn't "will this guarantee results?" It's "why would I skip 10 minutes of work with real upside and zero downside?"

Quick Recap

robots.txt  = access control for crawlers
sitemap.xml = URL discovery for search engines  
llms.txt    = content comprehension for AI models

For Next.js devs: Drop it in /public for a static file, or use a route handler at app/llms.txt/route.ts for dynamic generation.

For the robots.txt configuration: Explicitly allow answer bots (PerplexityBot, ClaudeBot, OAI-SearchBot) by User-Agent. Make a separate decision about training bots.

For the file content: Curate 5–15 pages with specific descriptions. The blockquote is mandatory. Keep it under 5KB.

If you want to go deeper on the broader AI SEO picture — how llms.txt fits into a full AEO (Answer Engine Optimization) and GEO (Generative Engine Optimization) strategy — the full guide on WebToolsHub covers that in detail, including the complete verification checklist and how to measure results in Search Console.

Questions or edge cases? Drop them below — happy to help debug any robots.txt / llms.txt setup issues.

DEV Community