guardlabs_team

Posted on May 7 • Originally published at guardlabs.online

How to Make Your Website AI-Agent Readable in 2026 (llms.txt, MCP Cards, Structured Data)

#seo #webdev #ai #tutorial

How to Make Your Website AI-Agent Readable in 2026 (llms.txt, MCP Cards, Structured Data)

You ask Perplexity a question about your niche industry. It gives a clean, well-sourced answer, citing three of your competitors. Your site, which has a definitive guide on the exact topic, is nowhere to be seen. You try again with ChatGPT, then Claude. Same result. It feels like being invisible.

This isn't a failure of traditional SEO. Your rankings on Google might be fine. This is a new problem: your website isn't "agent-readable." The large language models (LLMs) that power these AI agents are increasingly the first stop for users seeking information. If they can't parse, understand, and trust your content, you don't exist in this new ecosystem. Getting cited by an AI is becoming the new "page one" ranking.

This guide isn't about "using AI for SEO" fluff. It's a technical, practical manual for founders and operators who manage their own websites. We'll cover the specific file formats, server configurations, and data structures that AI crawlers from OpenAI, Anthropic, Google, and others are looking for right now. This is how you get your data out of your website and into their answers.

Why Agent-Readiness Is the New SEO

For two decades, SEO was about signaling relevance to algorithms like Google's PageRank. Now, we must also signal authority and structure to language models. The goal is different. Instead of just a click, you're aiming to become a citable source in a generated answer. This is a higher bar.

If you check your server logs today, you'll likely find that traffic from known AI crawlers (like GPTBot, ClaudeBot, and PerplexityBot) already makes up a small but growing slice of your traffic. For many sites, this is already in the 1-3% range and is expected to increase significantly. This is the data-gathering phase. The models are actively ingesting the web to train future versions. Being accessible now means you're part of that foundational knowledge.

Traditional SEO focuses on user intent leading to a click. Agent-readiness focuses on machine-readable data that allows an AI to satisfy user intent directly, with your site as a trusted source. The two are not mutually exclusive, but they require different tactics. A keyword-optimized blog post is great for Google Search. A well-structured page with clear JSON-LD, a permissive robots.txt, and maybe even an llms.txt\ file is what gets you cited by an AI agent.

The `llms.txt\` Specification: A User Manual for Your Site

The llms.txt\ file is a proposal, primarily championed by Anthropic (the makers of Claude), for a standardized way to give instructions to AI models about your site. Think of it as a robots.txt\ but for usage policy instead of crawling access. It tells models how they are permitted to use your content in their training and output.

What It Is and Where to Put It

An llms.txt\ file is a plain text file placed in the /.well-known/\ directory of your website. The full path should be https://yourdomain.com/.well-known/llms.txt\.

The file uses a simple field: value\ format. The key fields currently proposed are:

User-Agent: Specifies which bot the rules apply to. A *\ applies to all bots. You can also target specific bots like ClaudeBot\.
Allow: Specifies directories or pages that are explicitly permitted for use in training generative models.
Disallow: Specifies directories or pages that are forbidden from being used for training.
Allow-Citing: A proposed field to explicitly permit the model to cite your content.

A Practical `llms.txt\` Example

Here’s a configuration that allows all bots to use most of the site for training, disallows a private /members/\ area, and explicitly allows citing from the /articles/\ directory.

# Default policy for all LLM agents
User-Agent: *
Disallow: /members/
Disallow: /private-data/

# Allow all bots to cite our public articles
User-Agent: *
Allow-Citing: /articles/

# Specific rules for ClaudeBot, if needed
User-Agent: ClaudeBot
Allow: /

Pros and Cons of `llms.txt\`

Pro: It provides a clear, machine-readable way to state your usage terms. This is much better than burying it in a human-readable "Terms of Service" page that no crawler will ever parse.
Pro: It's forward-looking. Adopting it now signals that you're an engaged, technically savvy publisher.
Con: It's still a proposal. There is no guarantee all major AI companies will honor it. OpenAI, for example, currently relies on robots.txt\. It's a bet on a future standard.
Con: It adds another configuration file to maintain. For most small sites, a simple, permissive file is a set-and-forget task.

JSON-LD: Spoon-Feeding Structured Data to Machines

If you want an AI to understand the meaning of your content, you need to tell it what it's looking at. Is this page a product, an article, or a how-to guide? JSON-LD is a way to embed this structured data directly in your HTML, using the vocabulary from Schema.org.

AI agents, especially those focused on shopping or step-by-step instructions, actively look for this data. It's the difference between them trying to guess your product's price and you telling them directly: "price": "240"\. You should add the JSON-LD script tag within the `

` of your HTML. For most platforms (like WordPress with a plugin), this is handled for you once configured.

Key Schemas AI Agents Actually Use

Don't try to implement every schema. Focus on the ones that map to your content and are most valuable to AI agents.

Article: Essential for any blog post or publication. It clearly defines the author, publication date, headline, and body. This helps agents attribute content correctly.

 { "@context": "<a href="https://schema.org">https://schema.org</a>", "@type": "Article", "headline": "How to Make Your Website AI-Agent Readable", "author": { "@type": "Organization", "name": "GuardLabs" }, "datePublished": "2024-05-21" }
Product: If you sell anything, this is non-negotiable. It allows agents to pull product names, descriptions, pricing, availability, and reviews into comparison models. This is how you show up in "what's the best tool for X" queries. Our own Website Care plan could be marked up this way.

 { "@context": "<a href="https://schema.org">https://schema.org</a>", "@type": "Product", "name": "Website Care Plan", "image": "<a href="https://guardlabs.online/images/care-icon.png">https://guardlabs.online/images/care-icon.png</a>", "description": "Annual website maintenance and support.", "offers": { "@type": "Offer", "priceCurrency": "USD", "price": "240.00" } }
FAQPage: If you have a FAQ, mark it up. AI agents love FAQs because they are pre-packaged question-answer pairs. This makes it trivial for them to use your content to answer a user's question directly.
HowTo: For step-by-step guides, this schema is perfect. It breaks down the process into discrete steps, which an agent can then re-format and present to a user.

The main limitation of JSON-LD is that it's only as good as the data you provide. If your schema is incomplete or inaccurate (e.g., the price on the page doesn't match the price\ in the JSON-LD), it can confuse bots or cause them to distrust your site.

MCP Cards: A Business Card for Your Server

The Machine-readable Citable Page (MCP) protocol is a newer, more experimental concept. The idea is simple: what if, alongside your human-readable webpage, you provided a simple, structured JSON file that contained all the key citable information? This is an MCP "card."

An AI agent could fetch https://yourdomain.com/my-article.mcp.json\ to get the core facts of your article without having to parse HTML, ads, and navigation menus. This makes their job easier and your data cleaner.

When and How to Publish an MCP Card

You don't need an MCP card for every page. It's most useful for data-rich, citable content like reports, product pages, or reference guides.

To implement it, you create a static JSON file that follows the MCP spec and host it at a predictable URL. A common convention is to append .mcp.json\ to the original URL. You then link to it from your HTML page using a \ tag in the `

Company

Purpose

Honors `robots.txt`?

GPTBot

OpenAI

Crawls web data to improve future ChatGPT models.

Yes

ClaudeBot

Anthropic

Used for training Claude models.

Yes

PerplexityBot

Perplexity AI

Crawls the web to find answers for Perplexity's conversational search engine.

Yes

Google-Extended

Google

A separate crawler Google uses to improve Bard/Gemini. Opting out here does not affect Google Search.

Yes

CCBot

Common Crawl

Not a company, but a non-profit that crawls and archives the web. Its data is widely used to train many open-source and commercial LLMs.

Yes

Example `robots.txt` for AI Readiness

A sensible default for most businesses is to allow these bots. If you don't have a `robots.txt` file, create one in the root of your domain. Here is a permissive example:

User-agent: GPTBot
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Google-Extended
Allow: /

# You might want to disallow CCBot if you are concerned about
# your content being in a public dataset forever.
User-agent: CCBot
Disallow: /

# Keep your existing rules for other bots
User-agent: *
Disallow: /admin
Disallow: /private/

The only real "con" to allowing these bots is that they use bandwidth. However, their crawl rate is typically low and shouldn't impact performance for most sites. The bigger risk is being left out by disallowing them.

How to Verify: Are the Bots Actually Reading You?

How do you know if any of this is working? You can't just ask ChatGPT "did you read my site?" Instead, you need to test from the agent's perspective.

Check Server Logs: This is the ground truth. Filter your server's access logs for the user agents listed in the table above (e.g., `grep "GPTBot" /var/log/nginx/access.log). If you see entries with a \200 OKstatus code, you know they are successfully crawling your pages. If you see \403 Forbiddenor \503 Service Unavailable`, you have a problem.
Use `curl` to Impersonate a Bot: You can simulate a request from an AI crawler using the command-line tool `curl`. This is great for debugging firewall or CDN issues.

curl -A "GPTBot" -I https://yourdomain.com/my-article

The `-Aflag sets the User-Agent string. The \-Iflag just fetches the headers. If you get a \HTTP/2 200response, the bot can access your site. If you get a \403` or are presented with a CAPTCHA, your security settings are blocking it.
Prompt Engineering for Citation: After you've confirmed the bots are crawling your site and you've given them a few weeks to ingest the data, you can test for citation. The trick is to ask a question where your site is a uniquely authoritative source. Don't ask "what is a website care plan?" Ask something specific that only your content answers well, like: "According to guardlabs.online, what is included in their Website Care plan?" This forces the model to check its specific knowledge of your domain.

Common Mistakes That Make You Invisible to AI

Many well-intentioned sites accidentally block AI agents or make their content impossible to parse.

Overzealous Cloudflare Rules: The "Bot Fight Mode" or aggressive "Super Bot Attack Mode" settings in Cloudflare are notorious for blocking legitimate AI crawlers. They see a non-human user agent and present a JavaScript challenge that the bot cannot solve. You must go into your Cloudflare settings and specifically allow the user agents for `GPTBot, \ClaudeBot`, etc. Cloudflare's new "AI Audit" feature can help identify and allow these bots.
Content Behind Paywalls or Login Walls: An AI crawler is an unauthenticated user. If your definitive guide is behind a hard paywall or requires a login, the bot will only see the login page. It cannot index what it cannot see. If you run a membership site, consider having public, citable summaries or abstracts.
Missing Canonical URLs: If you have the same content accessible at multiple URLs (e.g., with and without `www, or with tracking parameters), you must use the \rel="canonical"` link tag to tell all bots which URL is the master version. Without it, AI models might see your content as duplicate or low-quality.
Relying on Images or Video for Key Info: LLMs primarily read text. If your product's price, specs, or key features are only available in an image or a video, the AI crawler will miss them. All critical information should exist as plain HTML text on the page.

Making your site agent-readable isn't a one-time fix; it's a new layer of web maintenance. It requires a shift in thinking from just pleasing human visitors and search engine spiders to also accommodating machine learning models. The sites that do this work now will become the trusted, citable sources for the next generation of search and information discovery.

If you've gone through this guide and feel it's more than you want to manage yourself, this is the kind of deep-dive technical audit we perform. Our Agent-Ready Site audit is a full readiness scan that covers everything mentioned here, from `robots.txt` configuration to JSON-LD validation and firewall rules, to ensure your site is positioned to be a source of truth for AI agents.

Originally published at guardlabs.online. More tooling for indie builders & small agencies — guardlabs.online.

Top comments (1)

Juan Wang • May 12

Worth adding a point on how AI search ranking actually works today:

The honest state is way earlier than it looks — most AI search still rides on traditional search infrastructure. ChatGPT Search publicly runs on Bing (OpenAI/Microsoft partnership). Perplexity started on Bing and has since blended in its own crawl. So ~90% of what we call "AI search ranking" today is really conventional Google/Bing SEO ranking, with an LLM re-ranking layer on top (relevance + recency + a bit of source authority).

Citation patterns inside LLM training data are also strongly correlated with link graph + frequency — Wikipedia, Reddit, large blogs get cited disproportionately because they appeared 10-100x more often in pretraining. That's PageRank-era inheritance baked directly into model weights.

So llms.txt / JSON-LD / MCP cards solve crawlability (can the AI fetch you at all) but haven't reached the ranking layer. Anthropic's own llms.txt spec explicitly positions it as crawl policy / usage permission, not a ranking signal.

The actual ranking innovation hasn't happened yet. If I had to bet a direction: LLM-to-LLM citations could become the new PageRank — not "how many sites link to you" but "how often does Claude cite this source when answering." But nobody has published what that graph looks like.

How to Make Your Website AI-Agent Readable in 2026 (llms.txt, MCP Cards, Structured Data)

Why Agent-Readiness Is the New SEO

The llms.txt\ Specification: A User Manual for Your Site

What It Is and Where to Put It

A Practical llms.txt\ Example

Pros and Cons of llms.txt\

JSON-LD: Spoon-Feeding Structured Data to Machines

Key Schemas AI Agents Actually Use

MCP Cards: A Business Card for Your Server

When and How to Publish an MCP Card

Example `robots.txt` for AI Readiness

How to Verify: Are the Bots Actually Reading You?

Common Mistakes That Make You Invisible to AI

The `llms.txt\` Specification: A User Manual for Your Site

A Practical `llms.txt\` Example

Pros and Cons of `llms.txt\`