DEV Community

Cover image for The Two UNDERRATED Files Every Website Needs
Sanchit Dikshit
Sanchit Dikshit

Posted on

The Two UNDERRATED Files Every Website Needs

How robots.txt and sitemap.xml quietly control your site's relationship with the internet.


While going down an SEO rabbit hole one evening, I came across something I had completely overlooked — two small, plain-text files that sit quietly at the root of nearly every website on the internet. Old or new. Big or small. Built on WordPress or hand-written in vanilla HTML.

robots.txt and sitemap.xml.

The story of these files goes back further than most people realize. In the early days of the web — mid-1990s — search engine bots had no rules. They crawled everything, followed every link, hammered servers, and indexed pages that site owners never intended to be public. It was chaos. So in 1994, a developer named Martijn Koster proposed a simple, voluntary agreement: websites would place a plain text file at their root, and crawlers would read it before doing anything. That file was robots.txt — and it became an informal standard almost overnight.

sitemap.xml came later, in the early 2000s, when the web exploded in complexity. Crawlers were getting better, but they were still missing deeply nested pages and dynamically generated URLs. Google introduced the sitemap format to let site owners tell crawlers exactly what existed — rather than waiting for them to discover it.

Two files. Decades of web history. And we might set them up wrong, or skip them entirely.

This post is everything I wish I had found in one place.


First, a Mental Model

Think of your website as an office building.

robots.txt is the security guard at the front desk — it tells visiting bots (Googlebot, Bingbot, GPTBot, you name it) what they're allowed to access and what's off-limits.

sitemap.xml is the floor directory on the wall — it tells those same bots exactly what rooms (pages) exist, where they are, and how important each one is.

Neither file is magic. But without them, you're expecting search engines to figure out a maze blindfolded.


robots.txt — The Bouncer Your Site Deserves

What it is

A plain text file that lives at the root of your domain: https://yourdomain.com/robots.txt

It follows a simple format:

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml
Enter fullscreen mode Exit fullscreen mode

Breaking it down

Directive What it does
User-agent: * Applies the rules to ALL bots
User-agent: Googlebot Applies rules only to Google's crawler
Disallow: /path/ Tells bots: stay out of here
Allow: /path/ Explicitly permits access (useful to override a broader Disallow)
Sitemap: Points bots directly to your sitemap — gold.

Real-world patterns

Block everything (staging/dev environments):

User-agent: *
Disallow: /
Enter fullscreen mode Exit fullscreen mode

Use this on staging.yoursite.com. You do NOT want Google indexing your half-finished pages.

Block AI scrapers specifically:

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /
Enter fullscreen mode Exit fullscreen mode

Yes, AI companies have their own crawlers. Yes, you can block them. Whether you should is your call.

Protect the back-end, expose the front:

User-agent: *
Disallow: /wp-admin/
Disallow: /wp-login.php
Disallow: /cart/
Disallow: /checkout/
Allow: /
Enter fullscreen mode Exit fullscreen mode

Classic WordPress setup. No reason for Google to index your login page.

The golden rule people forget

robots.txt is a request, not a lock.

Ethical crawlers (Google, Bing, etc.) will respect it. Malicious scrapers won't. If you have genuinely sensitive data, don't rely on robots.txt — use authentication.

Also: listing a URL in Disallow does NOT stop Google from knowing the URL exists if another page links to it. It just stops Googlebot from crawling it. Subtly different.


sitemap.xml — Your Site's CV for Search Engines

What it is

An XML file that lists every page you want indexed, along with optional metadata: when it was last modified, how often it changes, and its priority relative to other pages.

Bare minimum sitemap:

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://yourdomain.com/</loc>
  </url>
  <url>
    <loc>https://yourdomain.com/about/</loc>
  </url>
  <url>
    <loc>https://yourdomain.com/blog/my-first-post/</loc>
  </url>
</urlset>
Enter fullscreen mode Exit fullscreen mode

Full-featured entry:

<url>
  <loc>https://yourdomain.com/blog/robots-and-sitemaps/</loc>
  <lastmod>2026-04-15</lastmod>
  <changefreq>monthly</changefreq>
  <priority>0.8</priority>
</url>
Enter fullscreen mode Exit fullscreen mode

The optional fields, honestly rated

Field Honest usefulness
<lastmod> ✅ Actually useful. Tells Google when you updated content. Use it.
<changefreq> ⚠️ Mostly ignored by Google in practice. Include it anyway for other crawlers.
<priority> ⚠️ Google largely ignores it too. But it's a signal, however faint.

Sitemap index files

Got a large site? You can't cram 50,000 URLs into one file (50,000 is actually the limit per sitemap). Use a sitemap index:

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <sitemap>
    <loc>https://yourdomain.com/sitemap-posts.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://yourdomain.com/sitemap-pages.xml</loc>
  </sitemap>
  <sitemap>
    <loc>https://yourdomain.com/sitemap-products.xml</loc>
  </sitemap>
</sitemapindex>
Enter fullscreen mode Exit fullscreen mode

Point Google Search Console to the index file — it handles the rest.

Specialized sitemaps

Don't sleep on these:

  • Image sitemaps — Helps Google index your images in search
  • Video sitemaps — Required for rich video results
  • News sitemaps — For publishers wanting Google News inclusion

The Connection Between the Two

Here's the part most tutorials skip:

Your robots.txt should reference your sitemap.

User-agent: *
Disallow: /admin/

Sitemap: https://yourdomain.com/sitemap.xml
Enter fullscreen mode Exit fullscreen mode

That last line? It means any crawler that reads your robots.txt — even if they weren't specifically looking for a sitemap — now knows exactly where to find all your content. It's a two-for-one.

And the inverse: don't list URLs in your sitemap that you've disallowed in robots.txt. You're sending conflicting signals — "here's a page I want indexed" + "don't crawl this page." Google will be confused. Confused Google = bad rankings.


Framework Quickstart

Most frameworks handle this with plugins or built-in support:

Framework / Tool Sitemap robots.txt
Next.js 13+ app/sitemap.ts (built-in) app/robots.ts (built-in)
Nuxt 3 @nuxtjs/sitemap module nuxt-simple-robots module
Astro @astrojs/sitemap integration Manual or astro-robots-txt
WordPress Yoast SEO / Rank Math Yoast SEO / Rank Math
Hugo Built-in Manual in static/robots.txt
Gatsby gatsby-plugin-sitemap gatsby-plugin-robots-txt

If you're rolling a custom backend, generate the sitemap dynamically from your database/CMS and serve it at /sitemap.xml. Most frameworks make this straightforward.


Submitting to Search Console (Don't Skip This)

Waiting for Google to find your sitemap organically can take weeks. Do this instead:

  1. Go to Google Search Console
  2. Select your property
  3. Left sidebar → Sitemaps
  4. Enter sitemap.xml → Submit

You'll see index status, errors, and how many pages Google's actually processed. Check it once a month. Fix errors promptly.


Common Mistakes (I've Made Most of These)

🚫 Blocking CSS/JS files in robots.txt — Google needs to render your pages. If you block your stylesheets, Googlebot sees a broken page.

🚫 Including noindex pages in your sitemap — If a page has <meta name="robots" content="noindex">, don't put it in the sitemap. Contradictory signals.

🚫 Hardcoding www vs non-www inconsistently — Your sitemap URLs should match your canonical domain. All https://domain.com or all https://www.domain.com. Not a mix.

🚫 Forgetting to update the sitemap — If you add new pages but never regenerate the sitemap, those pages wait in the dark. Automate this.

🚫 Using relative URLs in sitemap — Every <loc> must be an absolute URL with the full scheme: https://yourdomain.com/page/ not /page/.


Things Most People Don't Know (The Interesting Bits)

This is the part I find genuinely fascinating — the stuff that doesn't show up in beginner SEO tutorials.

robots.txt has its own crawl budget implications.
Google has a limited "crawl budget" per site — a rough cap on how many pages it will crawl in a given period. For small sites this rarely matters. For large sites (thousands of pages), a poorly written robots.txt that doesn't block useless URLs (faceted navigation, session IDs, duplicate filtered pages) can eat your crawl budget on junk pages, leaving your important content under-crawled.

Google publicly logs its robots.txt parsing rules — and they're stricter than the spec.
The original robots.txt spec from 1994 is informal and not an official standard. Google published their own formal specification and even open-sourced their parser. One non-obvious rule: if your robots.txt file returns a 5xx server error, Googlebot treats the entire site as disallowed and stops crawling until it recovers. A broken server = Google treating your site as fully blocked.

noindex in HTTP headers works too — no HTML needed.
Most people know <meta name="robots" content="noindex"> in HTML. Fewer know you can send the same instruction as an HTTP response header: X-Robots-Tag: noindex. This is the only way to noindex non-HTML files like PDFs, since they have no <head> tag. Very useful for documentation or internal files served publicly.

Sitemaps are how Google first learned about many dark web discoveries.
This one's more trivia than actionable — but search engines have found accidentally public sitemaps on misconfigured servers that listed URLs the site owner never intended to be public. A sitemap is essentially a complete map of your site handed directly to any crawler that asks. Be intentional about what goes in it.

The robots.txt file is publicly readable by anyone — always.
It cannot be password protected (that would defeat its purpose). This means anyone can visit yourdomain.com/robots.txt and see exactly which paths you're trying to hide. Security researchers and curious people do this routinely. Don't rely on Disallow to obscure sensitive directory names — it's an announcement, not a curtain.

Google can index a page it has never crawled — if enough other sites link to it.
This means Disallow in robots.txt doesn't guarantee a page won't appear in search results. It just prevents Googlebot from reading the content. The URL can still show up as a "known but uncrawled" result. To truly remove a page from Google, you need noindex on the page itself (or use the URL removal tool in Search Console).


TL;DR

  • robots.txt → tells crawlers what to skip
  • sitemap.xml → tells crawlers what to index
  • Always link your sitemap from robots.txt
  • Submit to Google Search Console manually — don't wait
  • Keep them consistent with each other

Two files. One at the root of your domain. They take 20 minutes to set up properly and can meaningfully change how search engines see your site.


Every Website Has One. Go Check.

Here's something I want you to do right now.

Open a new tab and visit: youtube.com/robots.txt

You'll see hundreds of lines — specific bots being blocked, crawl-delay rules, dozens of sitemap references for different content categories. YouTube is one of the most visited sites on the planet, and they have a meticulously maintained robots.txt.

Now try it on any website you use regularly. Just append /robots.txt to any domain.

chatgpt.com/robots.txt. github.com/robots.txt. reddit.com/robots.txt. amazon.com/robots.txt.

Every single one has it. A startup's landing page. A government website. A 1999-era forum that hasn't been updated in a decade. This is what a 30-year-old voluntary internet standard looks like when it works — so universally adopted that it's basically invisible.

robots.txt isn't just a config file. It's a piece of living web history.


Tried it on a site and found something interesting in their robots.txt? Drop it in the comments.


Top comments (0)