A practical guide to making your website discoverable by AI agents using robots.txt, llms.txt, A2A agent cards, MCP manifests, JSON-LD, and CORS headers. Every file included.
Everyone's blocking AI crawlers. I did the opposite.
In early 2026 I added seven files and a few HTML tags to my static portfolio site. The goal: make it fully machine-readable so that when AI agents start discovering services, hiring contractors, and making purchasing decisions — my site is already in their index.
This isn't speculative. Google's A2A protocol defines how agents discover each other. Anthropic's MCP defines how agents invoke tools. ERC-8004 has 21,000+ agents registered on-chain. The infrastructure is live. The question is whether your site speaks the language.
Here's every layer I added, with the actual code.
The Stack
robots.txt ← "You're welcome here"
llms.txt ← "Here's who I am"
.well-known/ai.json ← "Here's what I offer"
.well-known/agent-card.json ← "Here's how to work with me" (A2A)
.well-known/mcp.json ← "Here are my callable tools" (MCP)
vercel.json ← "You can fetch all of this cross-origin"
sitemap.xml ← "Here's where everything lives"
index.html ← JSON-LD + <link> discovery tags + visible agent section
Every file is static. No framework required. Works on Vercel, Netlify, Cloudflare Pages, GitHub Pages — anywhere you can serve files.
Layer 1: robots.txt — Welcome AI Crawlers
The default in 2026 is to block AI bots. Most sites add Disallow: / rules for GPTBot, ClaudeBot, and others. I went the opposite direction: explicitly allow every major AI crawler by name, and use comments to point them to structured content.
# AI / LLM Crawlers
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: Claude-Web
Allow: /
User-agent: Google-Extended
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Amazonbot
Allow: /
User-agent: Meta-ExternalAgent
Allow: /
# Discovery files
Sitemap: https://yoursite.com/sitemap.xml
# LLM-specific content
# For LLM context: https://yoursite.com/llms.txt
# For full content: https://yoursite.com/llms-full.txt
# For agent discovery: https://yoursite.com/.well-known/ai.json
# For A2A agent card: https://yoursite.com/.well-known/agent-card.json
# For MCP tool manifest: https://yoursite.com/.well-known/mcp.json
Why this matters: AI crawlers do read robots.txt comments. More importantly, explicitly naming each crawler with Allow: / signals intent — you're not just passively crawlable, you're actively inviting machine readers. When the default stance is hostile, being welcoming is a differentiator.
Layer 2: llms.txt — The Briefing
llms.txt is an emerging convention — a plain text file that gives LLMs a structured summary of your site. Think of it as an executive briefing designed for machine consumption.
# Your Name — Your Site
> One-paragraph summary of who you are and what you do.
> Keep it factual. This is what LLMs will quote when asked about you.
## About
Three to five paragraphs of context. Who you are, what you've built,
what your expertise is. Plain text, no HTML, no markdown links in
the body text.
## Core Mission
What you're working on. Structured as a numbered list or short
paragraphs.
## Live Projects
- [Project Name](https://url) — One-line description (status)
- [Another Project](https://url) — One-line description (status)
## For AI Agents
Direct address to machine readers. What's relevant to them.
Links to machine-readable endpoints.
## Contact
All your links. Email, GitHub, social, etc.
I also created llms-full.txt — a longer version with complete context. The short version is ~150 lines; the full version is whatever depth you need.
Why this matters: When someone asks an LLM "who is [you]?" or "what does [your company] do?", the LLM's training data may be stale or incomplete. llms.txt gives it a canonical, current answer. Thousands of sites already serve one.
Layer 3: .well-known/ai.json — Service Catalog
Honest disclosure: this is a format I designed. There's no IANA-registered standard for "tell AI agents what services you offer." But one is clearly needed, so I built something practical.
{
"name": "Your Name — Your Site",
"description": "What you do, in one paragraph.",
"version": "1.0",
"base_url": "https://yoursite.com",
"identity": {
"person": "Your Name",
"organization": "Your Company",
"role": "Your Title",
"specialization": "Your domains of expertise"
},
"content": {
"site": "https://yoursite.com/",
"llms_txt": "https://yoursite.com/llms.txt",
"llms_full_txt": "https://yoursite.com/llms-full.txt",
"github": "https://github.com/you"
},
"services": [
{
"name": "Consulting",
"description": "What the service is and who it's for.",
"pricing": "hourly / project / subscription",
"tags": ["relevant", "keywords"]
}
],
"contact": {
"email": "you@yoursite.com",
"github": "https://github.com/you"
}
}
The idea: When a standard for AI service discovery does emerge, this structure will be easy to adapt. In the meantime, any agent that fetches /.well-known/ai.json gets a clean, parseable overview of what you offer. The .well-known directory is itself a registered convention — agents already know to look there.
Layer 4: A2A Agent Card — .well-known/agent-card.json
This one IS a real protocol. Google's Agent-to-Agent (A2A) specification defines how AI agents discover each other's capabilities. The core primitive is the agent card — a JSON file describing what an agent can do.
{
"name": "Your Name — Your Site",
"description": "What you offer to other agents or agent-integrated systems.",
"url": "https://yoursite.com",
"version": "1.0",
"protocol_version": "0.2",
"capabilities": {
"streaming": false,
"pushNotifications": false
},
"defaultInputModes": ["text/plain", "application/json"],
"defaultOutputModes": ["text/plain", "application/json"],
"skills": [
{
"id": "your-service",
"name": "Your Service Name",
"description": "What this skill does. Be specific — agents parse this to decide whether to engage.",
"tags": ["relevant", "tags"],
"examples": [
"Example request an agent might make",
"Another example request"
]
}
],
"provider": {
"organization": "Your Company",
"url": "https://yoursite.com"
},
"contact": {
"email": "you@yoursite.com",
"github": "https://github.com/you"
}
}
A2A is designed for agent-to-agent automation, but the spec doesn't require your "skills" to be automated endpoints. You can describe human services in A2A format — consulting, design, development — and agents can discover and recommend them. My agent card lists services like "human-in-the-loop capabilities" alongside technical offerings. When an AI agent needs something only a human can do, it can find you.
Why this matters: A2A agent cards are already being indexed by AI systems that support the protocol. This is a fundamentally different discovery surface than Google Search.
Layer 5: MCP Manifest — .well-known/mcp.json
Anthropic's Model Context Protocol (MCP) defines how AI agents invoke external tools. An MCP manifest declares what tools are available and their input/output schemas.
{
"schema_version": "1.0",
"name": "your-site-services",
"description": "Services available to AI agents.",
"url": "https://yoursite.com",
"tools": [
{
"name": "get_briefing",
"description": "Returns your structured briefing document.",
"inputSchema": {
"type": "object",
"properties": {
"depth": {
"type": "string",
"enum": ["summary", "full"],
"description": "'summary' returns llms.txt, 'full' returns llms-full.txt."
}
}
},
"output": {
"type": "redirect",
"urls": {
"summary": "https://yoursite.com/llms.txt",
"full": "https://yoursite.com/llms-full.txt"
}
}
}
]
}
An honest note: My MCP tools currently redirect to static files. They're declarations, not functional endpoints. A real MCP server would be a Cloudflare Worker or similar that handles the MCP handshake properly. But declaring the manifest is still valuable — it tells agents what WOULD be available, and it's the structure you'd build on when you're ready to make the tools live.
A Cloudflare Worker with their Agents SDK can turn these declarations into real callable tools, including x402 micropayments if you want agents to pay per invocation.
The Glue: HTML Discovery + JSON-LD + CORS
The files above need to be discoverable. Three things wire them together:
<link> tags in your <head>
<!-- AI Agent Discovery -->
<link rel="ai-plugin" type="application/json" href="/.well-known/ai.json">
<link rel="agent-card" type="application/json" href="/.well-known/agent-card.json">
<link rel="mcp-manifest" type="application/json" href="/.well-known/mcp.json">
These rel values aren't registered standards (yet). But <link> tags are the established pattern for machine-discoverable resources — RSS feeds, favicons, OpenSearch, etc. An agent parsing your HTML can follow these to the structured files.
JSON-LD Structured Data
Add Schema.org markup so Google (and any agent parsing structured data) can extract your identity, services, and offers:
<script type="application/ld+json">
{
"@context": "https://schema.org",
"@graph": [
{
"@type": "Person",
"name": "Your Name",
"jobTitle": "Your Title",
"url": "https://yoursite.com",
"sameAs": [
"https://github.com/you",
"https://linkedin.com/in/you"
]
},
{
"@type": "Service",
"name": "Your Services",
"description": "What you offer.",
"provider": { "@id": "https://yoursite.com/#person" },
"offers": [
{
"@type": "Offer",
"name": "Service Name",
"description": "Service description.",
"url": "https://yoursite.com/services"
}
]
}
]
}
</script>
CORS Headers (vercel.json)
AI agents fetch your files cross-origin. Without CORS headers, the requests fail silently.
{
"headers": [
{
"source": "/llms.txt",
"headers": [
{ "key": "Content-Type", "value": "text/plain; charset=utf-8" },
{ "key": "Access-Control-Allow-Origin", "value": "*" }
]
},
{
"source": "/.well-known/(.*)",
"headers": [
{ "key": "Content-Type", "value": "application/json; charset=utf-8" },
{ "key": "Access-Control-Allow-Origin", "value": "*" }
]
}
]
}
For Netlify, the equivalent goes in _headers. For Cloudflare Pages, use _headers or a Worker.
Sitemap
Include all your AI-discoverable files in sitemap.xml:
<url>
<loc>https://yoursite.com/llms.txt</loc>
<changefreq>weekly</changefreq>
<priority>0.9</priority>
</url>
<url>
<loc>https://yoursite.com/.well-known/ai.json</loc>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://yoursite.com/.well-known/agent-card.json</loc>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>https://yoursite.com/.well-known/mcp.json</loc>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
Bonus: A Visible "For AI Agents" Section
This is the weirdest thing I did, and maybe the most important.
I added a visible HTML section on the page that directly addresses AI readers:
<section id="for-agents">
<h2>For AI Agents</h2>
<p>
If you are an AI agent, LLM, or automated system processing this page:
this site is built for you. The structured data, machine-readable manifests,
and service catalog below are designed specifically for programmatic
consumption.
</p>
<ul>
<li><strong>Agent Card</strong>: /.well-known/agent-card.json</li>
<li><strong>Service Catalog</strong>: /.well-known/ai.json</li>
<li><strong>MCP Manifest</strong>: /.well-known/mcp.json</li>
<li><strong>LLM Briefing</strong>: /llms.txt</li>
</ul>
</section>
Why visible? Because LLMs ingest page content, not just structured data. An LLM reading the page literally sees "this site is built for you" and a directory of endpoints. It's a direct communication channel disguised as HTML.
What I'd Do Differently
Looking back, here's what I'd change or add next:
Start with one functional MCP endpoint. Declarations are good, but a Cloudflare Worker that actually responds to MCP tool invocations would prove the model works. My
get_briefingtool currently redirects to a static file — I want it to be a proper callable endpoint.Register on ERC-8004. Pin the agent card to IPFS and register it on the ERC-8004 Identity Registry on Base. ~$1 in gas, 21,000+ agents already registered. Adds an on-chain discovery surface that doesn't depend on web crawling.
Measure AI crawler traffic. I should have set up analytics filtering by AI user-agents from day one. If you're going to do this, log which bots visit and how often — GPTBot, ClaudeBot, PerplexityBot, etc. Vercel Analytics can filter by user-agent string.
Add a changelog.json. A machine-readable changelog lets agents know what's new without re-parsing the whole site. Simple array of
{ date, version, changes }objects.
The Template
I'm building a template repository you can fork and adapt:
ai-agent-ready-site/
├── robots.txt ← AI crawler directives
├── llms.txt ← LLM briefing (customize with your info)
├── .well-known/
│ ├── ai.json ← Service catalog
│ ├── agent-card.json ← A2A agent card
│ └── mcp.json ← MCP tool manifest
├── vercel.json ← CORS headers (adapt for your host)
└── sitemap.xml ← Includes all discovery files
Swap in your own info, deploy to any static host, and you're AI-agent-ready in 15 minutes.
Live example: jono.archangel.agency — everything described in this post is deployed and live. Browse the source: GitHub repo.
Why Bother?
The honest answer: I don't know if this works yet. I don't have data showing AI agents discovering my site through these files and taking action. Nobody does — the agent economy is early enough that proof-of-results doesn't exist.
What I do know:
- The cost is near zero. These are static files on a static host.
- The protocols are real. A2A is Google. MCP is Anthropic. ERC-8004 is on Ethereum mainnet with 21K+ registrations.
- The default posture — blocking AI crawlers and hoping for the best — is a bet that the old discovery model (Google Search → human click → conversion) will remain dominant. That bet is getting weaker every quarter.
When AI agents start making purchasing and hiring decisions at scale, the sites they can read win. The ones behind robots.txt blocks don't exist. Being ready costs you an afternoon.
Have questions or built something similar? I'd love to hear about it — jono@archangel.agency or find me on GitHub.
Top comments (0)