How robots.txt and sitemap.xml quietly control your site's relationship with the internet.
While going down an SEO rabbit hole one evening, I came across something I had completely overlooked — two small, plain-text files that sit quietly at the root of nearly every website on the internet. Old or new. Big or small. Built on WordPress or hand-written in vanilla HTML.
robots.txt and sitemap.xml.
The story of these files goes back further than most people realize. In the early days of the web — mid-1990s — search engine bots had no rules. They crawled everything, followed every link, hammered servers, and indexed pages that site owners never intended to be public. It was chaos. So in 1994, a developer named Martijn Koster proposed a simple, voluntary agreement: websites would place a plain text file at their root, and crawlers would read it before doing anything. That file was robots.txt — and it became an informal standard almost overnight.
sitemap.xml came later, in the early 2000s, when the web exploded in complexity. Crawlers were getting better, but they were still missing deeply nested pages and dynamically generated URLs. Google introduced the sitemap format to let site owners tell crawlers exactly what existed — rather than waiting for them to discover it.
Two files. Decades of web history. And we might set them up wrong, or skip them entirely.
This post is everything I wish I had found in one place.
First, a Mental Model
Think of your website as an office building.
robots.txt is the security guard at the front desk — it tells visiting bots (Googlebot, Bingbot, GPTBot, you name it) what they're allowed to access and what's off-limits.
sitemap.xml is the floor directory on the wall — it tells those same bots exactly what rooms (pages) exist, where they are, and how important each one is.
Neither file is magic. But without them, you're expecting search engines to figure out a maze blindfolded.
robots.txt — The Bouncer Your Site Deserves
What it is
A plain text file that lives at the root of your domain: https://yourdomain.com/robots.txt
It follows a simple format:
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /
Sitemap: https://yourdomain.com/sitemap.xml
Breaking it down
| Directive | What it does |
|---|---|
User-agent: * |
Applies the rules to ALL bots |
User-agent: Googlebot |
Applies rules only to Google's crawler |
Disallow: /path/ |
Tells bots: stay out of here |
Allow: /path/ |
Explicitly permits access (useful to override a broader Disallow) |
Sitemap: |
Points bots directly to your sitemap — gold. |
Real-world patterns
Block everything (staging/dev environments):
User-agent: *
Disallow: /
Use this on staging.yoursite.com. You do NOT want Google indexing your half-finished pages.
Block AI scrapers specifically:
User-agent: GPTBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: CCBot
Disallow: /
Yes, AI companies have their own crawlers. Yes, you can block them. Whether you should is your call.
Protect the back-end, expose the front:
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-login.php
Disallow: /cart/
Disallow: /checkout/
Allow: /
Classic WordPress setup. No reason for Google to index your login page.
The golden rule people forget
robots.txt is a request, not a lock.
Ethical crawlers (Google, Bing, etc.) will respect it. Malicious scrapers won't. If you have genuinely sensitive data, don't rely on robots.txt — use authentication.
Also: listing a URL in Disallow does NOT stop Google from knowing the URL exists if another page links to it. It just stops Googlebot from crawling it. Subtly different.
sitemap.xml — Your Site's CV for Search Engines
What it is
An XML file that lists every page you want indexed, along with optional metadata: when it was last modified, how often it changes, and its priority relative to other pages.
Bare minimum sitemap:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://yourdomain.com/</loc>
</url>
<url>
<loc>https://yourdomain.com/about/</loc>
</url>
<url>
<loc>https://yourdomain.com/blog/my-first-post/</loc>
</url>
</urlset>
Full-featured entry:
<url>
<loc>https://yourdomain.com/blog/robots-and-sitemaps/</loc>
<lastmod>2026-04-15</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
The optional fields, honestly rated
| Field | Honest usefulness |
|---|---|
<lastmod> |
✅ Actually useful. Tells Google when you updated content. Use it. |
<changefreq> |
⚠️ Mostly ignored by Google in practice. Include it anyway for other crawlers. |
<priority> |
⚠️ Google largely ignores it too. But it's a signal, however faint. |
Sitemap index files
Got a large site? You can't cram 50,000 URLs into one file (50,000 is actually the limit per sitemap). Use a sitemap index:
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<sitemap>
<loc>https://yourdomain.com/sitemap-posts.xml</loc>
</sitemap>
<sitemap>
<loc>https://yourdomain.com/sitemap-pages.xml</loc>
</sitemap>
<sitemap>
<loc>https://yourdomain.com/sitemap-products.xml</loc>
</sitemap>
</sitemapindex>
Point Google Search Console to the index file — it handles the rest.
Specialized sitemaps
Don't sleep on these:
- Image sitemaps — Helps Google index your images in search
- Video sitemaps — Required for rich video results
- News sitemaps — For publishers wanting Google News inclusion
The Connection Between the Two
Here's the part most tutorials skip:
Your robots.txt should reference your sitemap.
User-agent: *
Disallow: /admin/
Sitemap: https://yourdomain.com/sitemap.xml
That last line? It means any crawler that reads your robots.txt — even if they weren't specifically looking for a sitemap — now knows exactly where to find all your content. It's a two-for-one.
And the inverse: don't list URLs in your sitemap that you've disallowed in robots.txt. You're sending conflicting signals — "here's a page I want indexed" + "don't crawl this page." Google will be confused. Confused Google = bad rankings.
Framework Quickstart
Most frameworks handle this with plugins or built-in support:
| Framework / Tool | Sitemap | robots.txt |
|---|---|---|
| Next.js 13+ |
app/sitemap.ts (built-in) |
app/robots.ts (built-in) |
| Nuxt 3 |
@nuxtjs/sitemap module |
nuxt-simple-robots module |
| Astro |
@astrojs/sitemap integration |
Manual or astro-robots-txt
|
| WordPress | Yoast SEO / Rank Math | Yoast SEO / Rank Math |
| Hugo | Built-in | Manual in static/robots.txt
|
| Gatsby | gatsby-plugin-sitemap |
gatsby-plugin-robots-txt |
If you're rolling a custom backend, generate the sitemap dynamically from your database/CMS and serve it at /sitemap.xml. Most frameworks make this straightforward.
Submitting to Search Console (Don't Skip This)
Waiting for Google to find your sitemap organically can take weeks. Do this instead:
- Go to Google Search Console
- Select your property
- Left sidebar → Sitemaps
- Enter
sitemap.xml→ Submit
You'll see index status, errors, and how many pages Google's actually processed. Check it once a month. Fix errors promptly.
Common Mistakes (I've Made Most of These)
🚫 Blocking CSS/JS files in robots.txt — Google needs to render your pages. If you block your stylesheets, Googlebot sees a broken page.
🚫 Including noindex pages in your sitemap — If a page has <meta name="robots" content="noindex">, don't put it in the sitemap. Contradictory signals.
🚫 Hardcoding www vs non-www inconsistently — Your sitemap URLs should match your canonical domain. All https://domain.com or all https://www.domain.com. Not a mix.
🚫 Forgetting to update the sitemap — If you add new pages but never regenerate the sitemap, those pages wait in the dark. Automate this.
🚫 Using relative URLs in sitemap — Every <loc> must be an absolute URL with the full scheme: https://yourdomain.com/page/ not /page/.
Things Most People Don't Know (The Interesting Bits)
This is the part I find genuinely fascinating — the stuff that doesn't show up in beginner SEO tutorials.
robots.txt has its own crawl budget implications.
Google has a limited "crawl budget" per site — a rough cap on how many pages it will crawl in a given period. For small sites this rarely matters. For large sites (thousands of pages), a poorly written robots.txt that doesn't block useless URLs (faceted navigation, session IDs, duplicate filtered pages) can eat your crawl budget on junk pages, leaving your important content under-crawled.
Google publicly logs its robots.txt parsing rules — and they're stricter than the spec.
The original robots.txt spec from 1994 is informal and not an official standard. Google published their own formal specification and even open-sourced their parser. One non-obvious rule: if your robots.txt file returns a 5xx server error, Googlebot treats the entire site as disallowed and stops crawling until it recovers. A broken server = Google treating your site as fully blocked.
noindex in HTTP headers works too — no HTML needed.
Most people know <meta name="robots" content="noindex"> in HTML. Fewer know you can send the same instruction as an HTTP response header: X-Robots-Tag: noindex. This is the only way to noindex non-HTML files like PDFs, since they have no <head> tag. Very useful for documentation or internal files served publicly.
Sitemaps are how Google first learned about many dark web discoveries.
This one's more trivia than actionable — but search engines have found accidentally public sitemaps on misconfigured servers that listed URLs the site owner never intended to be public. A sitemap is essentially a complete map of your site handed directly to any crawler that asks. Be intentional about what goes in it.
The robots.txt file is publicly readable by anyone — always.
It cannot be password protected (that would defeat its purpose). This means anyone can visit yourdomain.com/robots.txt and see exactly which paths you're trying to hide. Security researchers and curious people do this routinely. Don't rely on Disallow to obscure sensitive directory names — it's an announcement, not a curtain.
Google can index a page it has never crawled — if enough other sites link to it.
This means Disallow in robots.txt doesn't guarantee a page won't appear in search results. It just prevents Googlebot from reading the content. The URL can still show up as a "known but uncrawled" result. To truly remove a page from Google, you need noindex on the page itself (or use the URL removal tool in Search Console).
TL;DR
-
robots.txt→ tells crawlers what to skip -
sitemap.xml→ tells crawlers what to index - Always link your sitemap from robots.txt
- Submit to Google Search Console manually — don't wait
- Keep them consistent with each other
Two files. One at the root of your domain. They take 20 minutes to set up properly and can meaningfully change how search engines see your site.
Every Website Has One. Go Check.
Here's something I want you to do right now.
Open a new tab and visit: youtube.com/robots.txt
You'll see hundreds of lines — specific bots being blocked, crawl-delay rules, dozens of sitemap references for different content categories. YouTube is one of the most visited sites on the planet, and they have a meticulously maintained robots.txt.
Now try it on any website you use regularly. Just append /robots.txt to any domain.
chatgpt.com/robots.txt. github.com/robots.txt. reddit.com/robots.txt. amazon.com/robots.txt.
Every single one has it. A startup's landing page. A government website. A 1999-era forum that hasn't been updated in a decade. This is what a 30-year-old voluntary internet standard looks like when it works — so universally adopted that it's basically invisible.
robots.txt isn't just a config file. It's a piece of living web history.
Tried it on a site and found something interesting in their robots.txt? Drop it in the comments.
Top comments (0)