Muhammad Ikramullah Khan

Posted on Jan 19

Understanding robots.txt: The Beginner's Guide for Web Scrapers

#webdev #web #tutorial #beginners

I scraped an e-commerce website. Downloaded 1000 product pages. Felt proud.

The next day, my IP was banned. I couldn't even visit the website normally. I had no idea why.

Then someone showed me the robots.txt file. It clearly said, "Don't scrape our product pages." I had broken the rules without even knowing they existed.

Let me show you what robots.txt is and why it matters, so you don't make the same mistake.

What is robots.txt?

robots.txt is a simple text file that websites use to tell crawlers (like you!) what they can and cannot scrape.

Think of it like a sign on a store:

┌─────────────────────────┐
│   STORE RULES           │
│                         │
│  ✓ You can browse       │
│  ✓ You can take photos  │
│  ✗ Don't enter kitchen  │
│  ✗ Don't go to basement │
└─────────────────────────┘

robots.txt is that sign, but for websites.

Where to Find robots.txt

Every robots.txt file is in the same place:

https://website.com/robots.txt

Just add /robots.txt to any website's main URL.

Examples

Amazon:

https://www.amazon.com/robots.txt

Reddit:

https://www.reddit.com/robots.txt

Your target website:

https://example.com/robots.txt

Try it! Open your browser and visit any website's robots.txt right now.

What Does robots.txt Look Like?

Here's a simple example:

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /products/
Crawl-delay: 5

Let me explain each line in plain English.

Understanding Each Line

Line 1: User-agent

User-agent: *

What it means:

User-agent = which crawlers these rules apply to
* = everyone (all crawlers)

Other examples:

User-agent: Googlebot

This means "these rules are for Google's crawler only"

User-agent: *

This means "these rules are for everyone"

Line 2-3: Disallow

Disallow: /admin/
Disallow: /private/

What it means:

Don't scrape these pages
Stay away from these sections

Examples:

Disallow: /admin/

Don't scrape https://example.com/admin/anything

Disallow: /api/

Don't scrape https://example.com/api/anything

Disallow: /

Don't scrape ANYTHING on this website!

Line 4: Allow

Allow: /products/

What it means:

You CAN scrape these pages
This section is okay

Why have Allow?

Sometimes they disallow everything, then allow specific parts:

User-agent: *
Disallow: /
Allow: /products/

This says: "Don't scrape anything EXCEPT products"

Line 5: Crawl-delay

Crawl-delay: 5

What it means:

Wait 5 seconds between each request
Don't hammer the server
Be polite

Real Examples

Example 1: Simple and Friendly

User-agent: *
Disallow: /admin/
Crawl-delay: 1

Translation:

Everyone can scrape
Except the /admin/ section
Wait 1 second between requests

What you should do:

Scrape whatever you want (except /admin/)
Add 1 second delay between requests
You're good to go!

Example 2: Mostly Blocked

User-agent: *
Disallow: /
Allow: /blog/

Translation:

Don't scrape anything
Except the blog section

What you should do:

Only scrape blog pages
Leave everything else alone

Example 3: Very Strict

User-agent: *
Disallow: /

Translation:

Don't scrape ANYTHING
Website doesn't want any scrapers

What you should do:

Respect their wishes
Don't scrape this website
Look for an official API instead

Example 4: Different Rules for Different Crawlers

User-agent: Googlebot
Allow: /

User-agent: *
Disallow: /
Allow: /public/

Translation:

Google can scrape everything
Everyone else can only scrape /public/ section

What you should do:

Only scrape /public/ pages
Don't pretend to be Googlebot (that's dishonest)

How to Check robots.txt Before Scraping

Step 1: Visit the URL

Open browser and go to:

https://yourwebsite.com/robots.txt

Step 2: Read It

Look for these key things:

1. Is everything disallowed?

Disallow: /

If yes, don't scrape.

2. Is your target section disallowed?

Disallow: /products/

If you want to scrape products, this says no.

3. Is there a crawl-delay?

Crawl-delay: 5

If yes, wait 5 seconds between requests.

Step 3: Respect It

If robots.txt says don't scrape, don't scrape.

Common Patterns Explained

Pattern 1: Block Admin Areas

Disallow: /admin/
Disallow: /login/
Disallow: /account/

What it means:
Don't scrape user accounts or admin panels.

Why:
These are private areas with personal data.

What you should do:
Never scrape these. You probably don't need this data anyway.

Pattern 2: Block API

Disallow: /api/

What it means:
Don't scrape the API directly.

Why:
They want you to use official API access.

What you should do:
Look for official API documentation. Register for API access.

Pattern 3: Block Search

Disallow: /search?
Disallow: /*?q=

What it means:
Don't scrape search results.

Why:
Search results are dynamic and put heavy load on server.

What you should do:
Don't scrape search. Scrape actual product pages instead.

Pattern 4: Wildcard Patterns

Disallow: /*.pdf$
Disallow: /*?sessionid=

What it means:

Don't download PDFs
Don't scrape pages with session IDs

Why:
PDFs are large files. Session IDs are personal/temporary.

What If There's No robots.txt?

If you visit:

https://example.com/robots.txt

And get a 404 error (not found), what does that mean?

Answer: No specific rules

But this doesn't mean "scrape everything as fast as you want!"

It means:

Be polite
Add delays
Don't hammer the server
Follow general ethical guidelines

Using robots.txt in Your Code

Method 1: Check Manually

Before writing any scraper:

Visit the robots.txt
Read the rules
Code accordingly

Method 2: Let Scrapy Handle It

# settings.py
ROBOTSTXT_OBEY = True

That's it! Scrapy automatically:

Downloads robots.txt
Reads the rules
Skips disallowed URLs
Respects crawl-delay

Example:

If robots.txt says:

Disallow: /admin/

And you try to scrape /admin/, Scrapy will skip it automatically.

Method 3: Using robotparser (Python)

If you're not using Scrapy:

from urllib.robotparser import RobotFileParser

# Create parser
parser = RobotFileParser()
parser.set_url('https://example.com/robots.txt')
parser.read()

# Check if URL is allowed
url = 'https://example.com/products/item1'
if parser.can_fetch('*', url):
    print("Allowed to scrape!")
else:
    print("Not allowed!")

Real-World Example

Let's say you want to scrape a bookstore website.

Step 1: Check robots.txt

Visit:

https://bookstore.com/robots.txt

You see:

User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Allow: /books/
Crawl-delay: 2

Step 2: Understand the Rules

Allowed:

/books/ (the books section)

Not Allowed:

/cart/ (shopping carts)
/checkout/ (checkout process)
/account/ (user accounts)

Must Do:

Wait 2 seconds between requests

Step 3: Write Your Scraper

import scrapy
import time

class BookSpider(scrapy.Spider):
    name = 'books'

    # Only scrape allowed sections
    start_urls = ['https://bookstore.com/books/']

    custom_settings = {
        'ROBOTSTXT_OBEY': True,  # Respect robots.txt
        'DOWNLOAD_DELAY': 2      # 2 second delay
    }

    def parse(self, response):
        for book in response.css('.book'):
            yield {
                'title': book.css('.title::text').get(),
                'price': book.css('.price::text').get()
            }

Step 4: What This Does

Only scrapes /books/ section (respects Allow)
Skips cart, checkout, account (respects Disallow)
Waits 2 seconds between pages (respects Crawl-delay)

Perfect!

Common Questions

Q: Is robots.txt legally binding?

Answer: Not technically, but:

It's the website's wishes
Ignoring it is unethical
Can get you banned
Could lead to legal issues in some cases

Bottom line: Always respect it.

Q: What if robots.txt is too restrictive?

Answer: You have options:

Look for an official API
Contact website owner for permission
Find a different data source
Accept the limitations

Don't break the rules!

Q: Can I scrape if robots.txt doesn't exist?

Answer: Technically yes, but:

Still be polite
Add delays
Don't overwhelm server
Follow ethical guidelines

Q: What's a reasonable crawl-delay if none is specified?

Answer: At least 1 second between requests. For large scraping jobs, 2-3 seconds is better.

Q: Do all scrapers obey robots.txt?

Answer: No. Bad scrapers ignore it. Don't be like them.

Good scrapers respect robots.txt. Be a good scraper.

Why Respecting robots.txt Matters

Reason 1: It's Polite

Websites work hard to provide content. Respect their rules.

Reason 2: Prevents Bans

Websites monitor crawler behavior. If you ignore robots.txt, you'll get banned.

Reason 3: Reduces Server Load

Crawl-delays exist to protect servers. Respect them.

Reason 4: It's Ethical

Just like you wouldn't walk into someone's house uninvited, don't scrape where you're not welcome.

Reason 5: Legal Protection

Respecting robots.txt shows you're acting in good faith.

Red Flags to Watch For

Red Flag 1: Disallow Everything

User-agent: *
Disallow: /

What to do: Don't scrape. Look for API or contact owner.

Red Flag 2: High Crawl-Delay

Crawl-delay: 60

60 seconds between requests? They really don't want scrapers.

What to do: Respect it or find another source.

Red Flag 3: Specific Bot Blocks

User-agent: BadBot
Disallow: /

If they're blocking specific scrapers by name, they're serious about anti-scraping.

Quick Reference Guide

Before scraping ANY website:

Visit robots.txt:

   https://website.com/robots.txt

Check for Disallow:

   Disallow: /your-target-section/

If your section is disallowed, don't scrape it.

Check for Crawl-delay:

   Crawl-delay: 5

Add this delay to your scraper.

In Scrapy, enable it:

   ROBOTSTXT_OBEY = True

Be extra polite: Even if allowed, add delays and be respectful.

Example robots.txt Files

Example 1: Reddit

User-agent: *
Crawl-delay: 2
Disallow: /r/*/submit
Disallow: /r/*/message

What it means:

2 second delay
Can scrape posts
Can't scrape submit forms or messages

Example 2: Amazon

User-agent: *
Disallow: /ap/
Disallow: /gp/cart/
Allow: /product/

What it means:

Can scrape product pages
Can't scrape login or cart

Example 3: Stack Overflow

User-agent: *
Crawl-delay: 1
Disallow: /users/

What it means:

1 second delay
Can scrape questions/answers
Can't scrape user profiles

Simple Decision Tree

Should I scrape this website?

Check robots.txt
    |
    ├─ Says "Disallow: /" 
    |       → Don't scrape. Look for API.
    |
    ├─ Says "Disallow: /my-section/"
    |       → Don't scrape that section. Try others.
    |
    ├─ Says nothing about my section
    |       → You can scrape! But add delays.
    |
    └─ 404 (no robots.txt)
            → You can scrape, but be extra polite.

Summary

What is robots.txt?
A file that tells you what you can and cannot scrape.

Where to find it:

https://website.com/robots.txt

Key things to look for:

Disallow: (what you can't scrape)
Allow: (what you can scrape)
Crawl-delay: (how long to wait)

In Scrapy:

ROBOTSTXT_OBEY = True

Remember:

Always check robots.txt first
Respect what it says
Add delays between requests
Be polite to servers
If in doubt, don't scrape

The golden rule:
Treat websites the way you'd want your website to be treated.

Practice Exercise

Try this right now:

Pick 3 websites you use regularly
Visit their robots.txt files
Read what they allow/disallow
Notice the patterns

Examples to try:

See what real websites say about scraping!

Final Thoughts

robots.txt is not complicated. It's just a simple text file with simple rules.

But respecting it is what separates professional scrapers from amateur ones.

Good scrapers:

Check robots.txt
Respect the rules
Add appropriate delays
Don't get banned

Bad scrapers:

Ignore robots.txt
Scrape everything fast
Get banned quickly
Give scraping a bad name

Be a good scraper. Always start with robots.txt.

Happy (and ethical) scraping! 🕷️