I scraped an e-commerce website. Downloaded 1000 product pages. Felt proud.
The next day, my IP was banned. I couldn't even visit the website normally. I had no idea why.
Then someone showed me the robots.txt file. It clearly said, "Don't scrape our product pages." I had broken the rules without even knowing they existed.
Let me show you what robots.txt is and why it matters, so you don't make the same mistake.
What is robots.txt?
robots.txt is a simple text file that websites use to tell crawlers (like you!) what they can and cannot scrape.
Think of it like a sign on a store:
┌─────────────────────────┐
│ STORE RULES │
│ │
│ ✓ You can browse │
│ ✓ You can take photos │
│ ✗ Don't enter kitchen │
│ ✗ Don't go to basement │
└─────────────────────────┘
robots.txt is that sign, but for websites.
Where to Find robots.txt
Every robots.txt file is in the same place:
https://website.com/robots.txt
Just add /robots.txt to any website's main URL.
Examples
Amazon:
https://www.amazon.com/robots.txt
Reddit:
https://www.reddit.com/robots.txt
Your target website:
https://example.com/robots.txt
Try it! Open your browser and visit any website's robots.txt right now.
What Does robots.txt Look Like?
Here's a simple example:
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /products/
Crawl-delay: 5
Let me explain each line in plain English.
Understanding Each Line
Line 1: User-agent
User-agent: *
What it means:
-
User-agent= which crawlers these rules apply to -
*= everyone (all crawlers)
Other examples:
User-agent: Googlebot
This means "these rules are for Google's crawler only"
User-agent: *
This means "these rules are for everyone"
Line 2-3: Disallow
Disallow: /admin/
Disallow: /private/
What it means:
- Don't scrape these pages
- Stay away from these sections
Examples:
Disallow: /admin/
Don't scrape https://example.com/admin/anything
Disallow: /api/
Don't scrape https://example.com/api/anything
Disallow: /
Don't scrape ANYTHING on this website!
Line 4: Allow
Allow: /products/
What it means:
- You CAN scrape these pages
- This section is okay
Why have Allow?
Sometimes they disallow everything, then allow specific parts:
User-agent: *
Disallow: /
Allow: /products/
This says: "Don't scrape anything EXCEPT products"
Line 5: Crawl-delay
Crawl-delay: 5
What it means:
- Wait 5 seconds between each request
- Don't hammer the server
- Be polite
Real Examples
Example 1: Simple and Friendly
User-agent: *
Disallow: /admin/
Crawl-delay: 1
Translation:
- Everyone can scrape
- Except the /admin/ section
- Wait 1 second between requests
What you should do:
- Scrape whatever you want (except /admin/)
- Add 1 second delay between requests
- You're good to go!
Example 2: Mostly Blocked
User-agent: *
Disallow: /
Allow: /blog/
Translation:
- Don't scrape anything
- Except the blog section
What you should do:
- Only scrape blog pages
- Leave everything else alone
Example 3: Very Strict
User-agent: *
Disallow: /
Translation:
- Don't scrape ANYTHING
- Website doesn't want any scrapers
What you should do:
- Respect their wishes
- Don't scrape this website
- Look for an official API instead
Example 4: Different Rules for Different Crawlers
User-agent: Googlebot
Allow: /
User-agent: *
Disallow: /
Allow: /public/
Translation:
- Google can scrape everything
- Everyone else can only scrape /public/ section
What you should do:
- Only scrape /public/ pages
- Don't pretend to be Googlebot (that's dishonest)
How to Check robots.txt Before Scraping
Step 1: Visit the URL
Open browser and go to:
https://yourwebsite.com/robots.txt
Step 2: Read It
Look for these key things:
1. Is everything disallowed?
Disallow: /
If yes, don't scrape.
2. Is your target section disallowed?
Disallow: /products/
If you want to scrape products, this says no.
3. Is there a crawl-delay?
Crawl-delay: 5
If yes, wait 5 seconds between requests.
Step 3: Respect It
If robots.txt says don't scrape, don't scrape.
Common Patterns Explained
Pattern 1: Block Admin Areas
Disallow: /admin/
Disallow: /login/
Disallow: /account/
What it means:
Don't scrape user accounts or admin panels.
Why:
These are private areas with personal data.
What you should do:
Never scrape these. You probably don't need this data anyway.
Pattern 2: Block API
Disallow: /api/
What it means:
Don't scrape the API directly.
Why:
They want you to use official API access.
What you should do:
Look for official API documentation. Register for API access.
Pattern 3: Block Search
Disallow: /search?
Disallow: /*?q=
What it means:
Don't scrape search results.
Why:
Search results are dynamic and put heavy load on server.
What you should do:
Don't scrape search. Scrape actual product pages instead.
Pattern 4: Wildcard Patterns
Disallow: /*.pdf$
Disallow: /*?sessionid=
What it means:
- Don't download PDFs
- Don't scrape pages with session IDs
Why:
PDFs are large files. Session IDs are personal/temporary.
What If There's No robots.txt?
If you visit:
https://example.com/robots.txt
And get a 404 error (not found), what does that mean?
Answer: No specific rules
But this doesn't mean "scrape everything as fast as you want!"
It means:
- Be polite
- Add delays
- Don't hammer the server
- Follow general ethical guidelines
Using robots.txt in Your Code
Method 1: Check Manually
Before writing any scraper:
- Visit the robots.txt
- Read the rules
- Code accordingly
Method 2: Let Scrapy Handle It
# settings.py
ROBOTSTXT_OBEY = True
That's it! Scrapy automatically:
- Downloads robots.txt
- Reads the rules
- Skips disallowed URLs
- Respects crawl-delay
Example:
If robots.txt says:
Disallow: /admin/
And you try to scrape /admin/, Scrapy will skip it automatically.
Method 3: Using robotparser (Python)
If you're not using Scrapy:
from urllib.robotparser import RobotFileParser
# Create parser
parser = RobotFileParser()
parser.set_url('https://example.com/robots.txt')
parser.read()
# Check if URL is allowed
url = 'https://example.com/products/item1'
if parser.can_fetch('*', url):
print("Allowed to scrape!")
else:
print("Not allowed!")
Real-World Example
Let's say you want to scrape a bookstore website.
Step 1: Check robots.txt
Visit:
https://bookstore.com/robots.txt
You see:
User-agent: *
Disallow: /cart/
Disallow: /checkout/
Disallow: /account/
Allow: /books/
Crawl-delay: 2
Step 2: Understand the Rules
Allowed:
-
/books/(the books section)
Not Allowed:
-
/cart/(shopping carts) -
/checkout/(checkout process) -
/account/(user accounts)
Must Do:
- Wait 2 seconds between requests
Step 3: Write Your Scraper
import scrapy
import time
class BookSpider(scrapy.Spider):
name = 'books'
# Only scrape allowed sections
start_urls = ['https://bookstore.com/books/']
custom_settings = {
'ROBOTSTXT_OBEY': True, # Respect robots.txt
'DOWNLOAD_DELAY': 2 # 2 second delay
}
def parse(self, response):
for book in response.css('.book'):
yield {
'title': book.css('.title::text').get(),
'price': book.css('.price::text').get()
}
Step 4: What This Does
- Only scrapes
/books/section (respects Allow) - Skips cart, checkout, account (respects Disallow)
- Waits 2 seconds between pages (respects Crawl-delay)
Perfect!
Common Questions
Q: Is robots.txt legally binding?
Answer: Not technically, but:
- It's the website's wishes
- Ignoring it is unethical
- Can get you banned
- Could lead to legal issues in some cases
Bottom line: Always respect it.
Q: What if robots.txt is too restrictive?
Answer: You have options:
- Look for an official API
- Contact website owner for permission
- Find a different data source
- Accept the limitations
Don't break the rules!
Q: Can I scrape if robots.txt doesn't exist?
Answer: Technically yes, but:
- Still be polite
- Add delays
- Don't overwhelm server
- Follow ethical guidelines
Q: What's a reasonable crawl-delay if none is specified?
Answer: At least 1 second between requests. For large scraping jobs, 2-3 seconds is better.
Q: Do all scrapers obey robots.txt?
Answer: No. Bad scrapers ignore it. Don't be like them.
Good scrapers respect robots.txt. Be a good scraper.
Why Respecting robots.txt Matters
Reason 1: It's Polite
Websites work hard to provide content. Respect their rules.
Reason 2: Prevents Bans
Websites monitor crawler behavior. If you ignore robots.txt, you'll get banned.
Reason 3: Reduces Server Load
Crawl-delays exist to protect servers. Respect them.
Reason 4: It's Ethical
Just like you wouldn't walk into someone's house uninvited, don't scrape where you're not welcome.
Reason 5: Legal Protection
Respecting robots.txt shows you're acting in good faith.
Red Flags to Watch For
Red Flag 1: Disallow Everything
User-agent: *
Disallow: /
What to do: Don't scrape. Look for API or contact owner.
Red Flag 2: High Crawl-Delay
Crawl-delay: 60
60 seconds between requests? They really don't want scrapers.
What to do: Respect it or find another source.
Red Flag 3: Specific Bot Blocks
User-agent: BadBot
Disallow: /
If they're blocking specific scrapers by name, they're serious about anti-scraping.
Quick Reference Guide
Before scraping ANY website:
- Visit robots.txt:
https://website.com/robots.txt
- Check for Disallow:
Disallow: /your-target-section/
If your section is disallowed, don't scrape it.
- Check for Crawl-delay:
Crawl-delay: 5
Add this delay to your scraper.
- In Scrapy, enable it:
ROBOTSTXT_OBEY = True
- Be extra polite: Even if allowed, add delays and be respectful.
Example robots.txt Files
Example 1: Reddit
User-agent: *
Crawl-delay: 2
Disallow: /r/*/submit
Disallow: /r/*/message
What it means:
- 2 second delay
- Can scrape posts
- Can't scrape submit forms or messages
Example 2: Amazon
User-agent: *
Disallow: /ap/
Disallow: /gp/cart/
Allow: /product/
What it means:
- Can scrape product pages
- Can't scrape login or cart
Example 3: Stack Overflow
User-agent: *
Crawl-delay: 1
Disallow: /users/
What it means:
- 1 second delay
- Can scrape questions/answers
- Can't scrape user profiles
Simple Decision Tree
Should I scrape this website?
Check robots.txt
|
├─ Says "Disallow: /"
| → Don't scrape. Look for API.
|
├─ Says "Disallow: /my-section/"
| → Don't scrape that section. Try others.
|
├─ Says nothing about my section
| → You can scrape! But add delays.
|
└─ 404 (no robots.txt)
→ You can scrape, but be extra polite.
Summary
What is robots.txt?
A file that tells you what you can and cannot scrape.
Where to find it:
https://website.com/robots.txt
Key things to look for:
-
Disallow:(what you can't scrape) -
Allow:(what you can scrape) -
Crawl-delay:(how long to wait)
In Scrapy:
ROBOTSTXT_OBEY = True
Remember:
- Always check robots.txt first
- Respect what it says
- Add delays between requests
- Be polite to servers
- If in doubt, don't scrape
The golden rule:
Treat websites the way you'd want your website to be treated.
Practice Exercise
Try this right now:
- Pick 3 websites you use regularly
- Visit their robots.txt files
- Read what they allow/disallow
- Notice the patterns
Examples to try:
- https://www.amazon.com/robots.txt
- https://www.reddit.com/robots.txt
- https://www.stackoverflow.com/robots.txt
See what real websites say about scraping!
Final Thoughts
robots.txt is not complicated. It's just a simple text file with simple rules.
But respecting it is what separates professional scrapers from amateur ones.
Good scrapers:
- Check robots.txt
- Respect the rules
- Add appropriate delays
- Don't get banned
Bad scrapers:
- Ignore robots.txt
- Scrape everything fast
- Get banned quickly
- Give scraping a bad name
Be a good scraper. Always start with robots.txt.
Happy (and ethical) scraping! 🕷️
Top comments (0)