DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Understanding robots.txt: The Beginner's Guide for Web Scrapers

I scraped an e-commerce website. Downloaded 1000 product pages. Felt proud.

The next day, my IP was banned. I couldn't even visit the website normally. I had no idea why.

Then someone showed me the robots.txt file. It clearly said, "Don't scrape our product pages." I had broken the rules without even knowing they existed.

Let me show you what robots.txt is and why it matters, so you don't make the same mistake.


What is robots.txt?

robots.txt is a simple text file that websites use to tell crawlers (like you!) what they can and cannot scrape.

Think of it like a sign on a store:

┌─────────────────────────┐
│   STORE RULES           │
│                         │
│  ✓ You can browse       │
│  ✓ You can take photos  │
│  ✗ Don't enter kitchen  │
│  ✗ Don't go to basement │
└─────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

robots.txt is that sign, but for websites.


Where to Find robots.txt

Every robots.txt file is in the same place:

https://website.com/robots.txt
Enter fullscreen mode Exit fullscreen mode

Just add /robots.txt to any website's main URL.

Examples

Amazon:

https://www.amazon.com/robots.txt
Enter fullscreen mode Exit fullscreen mode

Reddit:

https://www.reddit.com/robots.txt
Enter fullscreen mode Exit fullscreen mode

Your target website:

https://example.com/robots.txt
Enter fullscreen mode Exit fullscreen mode

Try it! Open your browser and visit any website's robots.txt right now.


What Does robots.txt Look Like?

Here's a simple example:

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /products/
Crawl-delay: 5
Enter fullscreen mode Exit fullscreen mode

Let me explain each line in plain English.


Understanding Each Line

Line 1: User-agent

User-agent: *
Enter fullscreen mode Exit fullscreen mode

What it means:

  • User-agent = which crawlers these rules apply to
  • * = everyone (all crawlers)

Other examples:

User-agent: Googlebot
Enter fullscreen mode Exit fullscreen mode

This means "these rules are for Google's crawler only"

User-agent: *
Enter fullscreen mode Exit fullscreen mode

This means "these rules are for everyone"

Line 2-3: Disallow

Disallow: /admin/
Disallow: /private/
Enter fullscreen mode Exit fullscreen mode

What it means:

  • Don't scrape these pages
  • Stay away from these sections

Examples:

Disallow: /admin/
Enter fullscreen mode Exit fullscreen mode

Don't scrape https://example.com/admin/anything

Disallow: /api/
Enter fullscreen mode Exit fullscreen mode

Don't scrape https://example.com/api/anything

Disallow: /
Enter fullscreen mode Exit fullscreen mode

Don't scrape ANYTHING on this website!

Line 4: Allow

Allow: /products/
Enter fullscreen mode Exit fullscreen mode

What it means:

  • You CAN scrape these pages
  • This section is okay

Why have Allow?

Sometimes they disallow everything, then allow specific parts:

User-agent: *
Disallow: /
Allow: /products/
Enter fullscreen mode Exit fullscreen mode

This says: "Don't scrape anything EXCEPT products"

Line 5: Crawl-delay

Crawl-delay: 5
Enter fullscreen mode Exit fullscreen mode

What it means:

  • Wait 5 seconds between each request
  • Don't hammer the server
  • Be polite

Real Examples

Example 1: Simple and Friendly

User-agent: *
Disallow: /admin/
Crawl-delay: 1
Enter fullscreen mode Exit fullscreen mode

Translation:

  • Everyone can scrape
  • Except the /admin/ section
  • Wait 1 second between requests

What you should do:

  • Scrape whatever you want (except /admin/)
  • Add 1 second delay between requests
  • You're good to go!

Example 2: Mostly Blocked

User-agent: *
Disallow: /
Allow: /blog/
Enter fullscreen mode Exit fullscreen mode

Translation:

  • Don't scrape anything
  • Except the blog section

What you should do:

  • Only scrape blog pages
  • Leave everything else alone

Example 3: Very Strict

User-agent: *
Disallow: /
Enter fullscreen mode Exit fullscreen mode

Translation:

  • Don't scrape ANYTHING
  • Website doesn't want any scrapers

What you should do:

  • Respect their wishes
  • Don't scrape this website
  • Look for an official API instead

Example 4: Different Rules for Different Crawlers

User-agent: Googlebot
Allow: /

User-agent: *
Disallow: /
Allow: /public/
Enter fullscreen mode Exit fullscreen mode

Translation:

  • Google can scrape everything
  • Everyone else can only scrape /public/ section

What you should do:

  • Only scrape /public/ pages
  • Don't pretend to be Googlebot (that's dishonest)

How to Check robots.txt Before Scraping

Step 1: Visit the URL

Open browser and go to:

https://yourwebsite.com/robots.txt
Enter fullscreen mode Exit fullscreen mode

Step 2: Read It

Look for these key things:

1. Is everything disallowed?

Disallow: /
Enter fullscreen mode Exit fullscreen mode

If yes, don't scrape.

2. Is your target section disallowed?

Disallow: /products/
Enter fullscreen mode Exit fullscreen mode

If you want to scrape products, this says no.

3. Is there a crawl-delay?

Crawl-delay: 5
Enter fullscreen mode Exit fullscreen mode

If yes, wait 5 seconds between requests.

Step 3: Respect It

If robots.txt says don't scrape, don't scrape.


Common Patterns Explained

Pattern 1: Block Admin Areas

Disallow: /admin/
Disallow: /login/
Disallow: /account/
Enter fullscreen mode Exit fullscreen mode

What it means:
Don't scrape user accounts or admin panels.

Why:
These are private areas with personal data.

What you should do:
Never scrape these. You probably don't need this data anyway.

Pattern 2: Block API

Disallow: /api/
Enter fullscreen mode Exit fullscreen mode

What it means:
Don't scrape the API directly.

Why:
They want you to use official API access.

What you should do:
Look for official API documentation. Register for API access.

Pattern 3: Block Search

Disallow: /search?
Disallow: /*?q=
Enter fullscreen mode Exit fullscreen mode

What it means:
Don't scrape search results.

Why:
Search results are dynamic and put heavy load on server.

What you should do:
Don't scrape search. Scrape actual product pages instead.

Pattern 4: Wildcard Patterns

Disallow: /*.pdf$
Disallow: /*?sessionid=
Enter fullscreen mode Exit fullscreen mode

What it means:

  • Don't download PDFs
  • Don't scrape pages with session IDs

Why:
PDFs are large files. Session IDs are personal/temporary.


What If There's No robots.txt?

If you visit:

https://example.com/robots.txt
Enter fullscreen mode Exit fullscreen mode

And get a 404 error (not found), what does that mean?

Answer: No specific rules

But this doesn't mean "scrape everything as fast as you want!"

It means:

  • Be polite
  • Add delays
  • Don't hammer the server
  • Follow general ethical guidelines

Using robots.txt in Your Code

Method 1: Check Manually

Before writing any scraper:

  1. Visit the robots.txt
  2. Read the rules
  3. Code accordingly

Method 2: Let Scrapy Handle It

# settings.py
ROBOTSTXT_OBEY = True
Enter fullscreen mode Exit fullscreen mode

That's it! Scrapy automatically:

  • Downloads robots.txt
  • Reads the rules
  • Skips disallowed URLs
  • Respects crawl-delay

Example:

If robots.txt says:

Disallow: /admin/
Enter fullscreen mode Exit fullscreen mode

And you try to scrape /admin/, Scrapy will skip it automatically.

Method 3: Using robotparser (Python)

If you're not using Scrapy:

from urllib.robotparser import RobotFileParser

# Create parser
parser = RobotFileParser()
parser.set_url('https://example.com/robots.txt')
parser.read()

# Check if URL is allowed
url = 'https://example.com/products/item1'
if parser.can_fetch('*', url):
    print("Allowed to scrape!")
else:
    print("Not allowed!")
Enter fullscreen mode Exit fullscreen mode

Real-World Example

Let's say you want to scrape a bookstore website.

Step 1: Check robots.txt

Visit:

https://bookstore.com/robots.txt
Enter fullscreen mode Exit fullscreen mode

You see:

User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Allow: /books/
Crawl-delay: 2
Enter fullscreen mode Exit fullscreen mode

Step 2: Understand the Rules

Allowed:

  • /books/ (the books section)

Not Allowed:

  • /cart/ (shopping carts)
  • /checkout/ (checkout process)
  • /account/ (user accounts)

Must Do:

  • Wait 2 seconds between requests

Step 3: Write Your Scraper

import scrapy
import time

class BookSpider(scrapy.Spider):
    name = 'books'

    # Only scrape allowed sections
    start_urls = ['https://bookstore.com/books/']

    custom_settings = {
        'ROBOTSTXT_OBEY': True,  # Respect robots.txt
        'DOWNLOAD_DELAY': 2      # 2 second delay
    }

    def parse(self, response):
        for book in response.css('.book'):
            yield {
                'title': book.css('.title::text').get(),
                'price': book.css('.price::text').get()
            }
Enter fullscreen mode Exit fullscreen mode

Step 4: What This Does

  • Only scrapes /books/ section (respects Allow)
  • Skips cart, checkout, account (respects Disallow)
  • Waits 2 seconds between pages (respects Crawl-delay)

Perfect!


Common Questions

Q: Is robots.txt legally binding?

Answer: Not technically, but:

  • It's the website's wishes
  • Ignoring it is unethical
  • Can get you banned
  • Could lead to legal issues in some cases

Bottom line: Always respect it.

Q: What if robots.txt is too restrictive?

Answer: You have options:

  1. Look for an official API
  2. Contact website owner for permission
  3. Find a different data source
  4. Accept the limitations

Don't break the rules!

Q: Can I scrape if robots.txt doesn't exist?

Answer: Technically yes, but:

  • Still be polite
  • Add delays
  • Don't overwhelm server
  • Follow ethical guidelines

Q: What's a reasonable crawl-delay if none is specified?

Answer: At least 1 second between requests. For large scraping jobs, 2-3 seconds is better.

Q: Do all scrapers obey robots.txt?

Answer: No. Bad scrapers ignore it. Don't be like them.

Good scrapers respect robots.txt. Be a good scraper.


Why Respecting robots.txt Matters

Reason 1: It's Polite

Websites work hard to provide content. Respect their rules.

Reason 2: Prevents Bans

Websites monitor crawler behavior. If you ignore robots.txt, you'll get banned.

Reason 3: Reduces Server Load

Crawl-delays exist to protect servers. Respect them.

Reason 4: It's Ethical

Just like you wouldn't walk into someone's house uninvited, don't scrape where you're not welcome.

Reason 5: Legal Protection

Respecting robots.txt shows you're acting in good faith.


Red Flags to Watch For

Red Flag 1: Disallow Everything

User-agent: *
Disallow: /
Enter fullscreen mode Exit fullscreen mode

What to do: Don't scrape. Look for API or contact owner.

Red Flag 2: High Crawl-Delay

Crawl-delay: 60
Enter fullscreen mode Exit fullscreen mode

60 seconds between requests? They really don't want scrapers.

What to do: Respect it or find another source.

Red Flag 3: Specific Bot Blocks

User-agent: BadBot
Disallow: /
Enter fullscreen mode Exit fullscreen mode

If they're blocking specific scrapers by name, they're serious about anti-scraping.


Quick Reference Guide

Before scraping ANY website:

  1. Visit robots.txt:
   https://website.com/robots.txt
Enter fullscreen mode Exit fullscreen mode
  1. Check for Disallow:
   Disallow: /your-target-section/
Enter fullscreen mode Exit fullscreen mode

If your section is disallowed, don't scrape it.

  1. Check for Crawl-delay:
   Crawl-delay: 5
Enter fullscreen mode Exit fullscreen mode

Add this delay to your scraper.

  1. In Scrapy, enable it:
   ROBOTSTXT_OBEY = True
Enter fullscreen mode Exit fullscreen mode
  1. Be extra polite: Even if allowed, add delays and be respectful.

Example robots.txt Files

Example 1: Reddit

User-agent: *
Crawl-delay: 2
Disallow: /r/*/submit
Disallow: /r/*/message
Enter fullscreen mode Exit fullscreen mode

What it means:

  • 2 second delay
  • Can scrape posts
  • Can't scrape submit forms or messages

Example 2: Amazon

User-agent: *
Disallow: /ap/
Disallow: /gp/cart/
Allow: /product/
Enter fullscreen mode Exit fullscreen mode

What it means:

  • Can scrape product pages
  • Can't scrape login or cart

Example 3: Stack Overflow

User-agent: *
Crawl-delay: 1
Disallow: /users/
Enter fullscreen mode Exit fullscreen mode

What it means:

  • 1 second delay
  • Can scrape questions/answers
  • Can't scrape user profiles

Simple Decision Tree

Should I scrape this website?

Check robots.txt
    |
    ├─ Says "Disallow: /" 
    |       → Don't scrape. Look for API.
    |
    ├─ Says "Disallow: /my-section/"
    |       → Don't scrape that section. Try others.
    |
    ├─ Says nothing about my section
    |       → You can scrape! But add delays.
    |
    └─ 404 (no robots.txt)
            → You can scrape, but be extra polite.
Enter fullscreen mode Exit fullscreen mode

Summary

What is robots.txt?
A file that tells you what you can and cannot scrape.

Where to find it:

https://website.com/robots.txt
Enter fullscreen mode Exit fullscreen mode

Key things to look for:

  • Disallow: (what you can't scrape)
  • Allow: (what you can scrape)
  • Crawl-delay: (how long to wait)

In Scrapy:

ROBOTSTXT_OBEY = True
Enter fullscreen mode Exit fullscreen mode

Remember:

  • Always check robots.txt first
  • Respect what it says
  • Add delays between requests
  • Be polite to servers
  • If in doubt, don't scrape

The golden rule:
Treat websites the way you'd want your website to be treated.


Practice Exercise

Try this right now:

  1. Pick 3 websites you use regularly
  2. Visit their robots.txt files
  3. Read what they allow/disallow
  4. Notice the patterns

Examples to try:

See what real websites say about scraping!


Final Thoughts

robots.txt is not complicated. It's just a simple text file with simple rules.

But respecting it is what separates professional scrapers from amateur ones.

Good scrapers:

  • Check robots.txt
  • Respect the rules
  • Add appropriate delays
  • Don't get banned

Bad scrapers:

  • Ignore robots.txt
  • Scrape everything fast
  • Get banned quickly
  • Give scraping a bad name

Be a good scraper. Always start with robots.txt.

Happy (and ethical) scraping! 🕷️

Top comments (0)