When I first started with Scrapy, it felt like magic. I wrote a spider, ran it, and data appeared. But I had no idea what was happening behind the scenes.
Then my spider broke. I couldn't debug it because I didn't understand how Scrapy actually worked. After finally learning the internal flow, everything clicked. Debugging became easy.
Let me explain how Scrapy works in the simplest way possible, so you understand what's really happening.
The Big Picture (Super Simple)
Imagine you're at a restaurant:
- You (the spider) tell the waiter (Scrapy engine) what you want
- Waiter writes it down on a list (scheduler)
- Waiter takes the order to the kitchen (downloader)
- Kitchen makes your food and gives it to waiter
- Waiter brings food to you
- You eat it and maybe order more
That's basically how Scrapy works!
Let's break it down step by step.
The Main Parts of Scrapy
Before we follow a request, understand the main parts:
1. Spider (You)
- Your code
- Decides what to scrape
- Processes the data
2. Engine (The Manager)
- Coordinates everything
- Tells everyone what to do
- Makes sure things run smoothly
3. Scheduler (The Queue)
- Keeps a list of URLs to scrape
- Decides what to scrape next
- Removes duplicates
4. Downloader (The Internet Fetcher)
- Downloads web pages
- Handles the actual HTTP requests
- Gets the HTML for you
5. Pipelines (Data Processors)
- Clean your data
- Save to database or file
- Validate items
That's it. Just 5 main parts.
Step-by-Step: What Happens When You Run Scrapy
Let's follow one single request from start to finish.
Step 1: You Write a Spider
import scrapy
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['https://example.com']
def parse(self, response):
yield {'title': response.css('h1::text').get()}
Simple spider that scrapes a title.
Step 2: You Run the Spider
scrapy crawl myspider
Now watch what happens inside Scrapy...
The Journey of Your Request
Stage 1: Spider Creates Request
Your spider says: "I want to scrape https://example.com"
start_urls = ['https://example.com']
Scrapy converts this to a Request object:
Request(url='https://example.com', callback=self.parse)
What this means:
- URL to fetch: https://example.com
- What to do with response: call
parsemethod
Stage 2: Engine Takes Request
The Engine receives the request and says: "Okay, got a new request!"
Stage 3: Engine Gives Request to Scheduler
Engine tells Scheduler: "Add this to the queue"
Scheduler says: "Added! This is request #1 in the queue"
What Scheduler does:
- Stores the request
- Checks if it's a duplicate (if so, ignores it)
- Puts it in line to be processed
Stage 4: Engine Asks Scheduler for Next Request
Engine asks: "What's the next request to process?"
Scheduler says: "Here you go, https://example.com"
Stage 5: Engine Sends Request to Downloader
Engine tells Downloader: "Go get this page for me"
Stage 6: Downloader Fetches the Page
Downloader goes to the internet:
- Makes HTTP request to https://example.com
- Waits for response
- Downloads the HTML
This is where the actual internet connection happens!
Stage 7: Downloader Returns Response
Downloader says: "Got it! Here's the HTML"
Returns a Response object with:
- Status code (200, 404, etc.)
- HTML content
- Headers
- URL
Stage 8: Engine Sends Response to Spider
Engine tells your spider: "Here's the response you asked for"
Calls your parse method:
def parse(self, response):
# You're here now!
yield {'title': response.css('h1::text').get()}
Stage 9: Spider Processes Response
Your spider:
- Extracts the title
- Creates an item (dictionary)
- Yields it
yield {'title': 'Example Domain'}
If you wanted to scrape more pages, you'd also yield new requests:
yield {'title': response.css('h1::text').get()}
yield scrapy.Request('https://example.com/page2', callback=self.parse)
Stage 10: Engine Receives Items and Requests
Engine receives what you yielded:
- Items go to pipelines
- New requests go back to scheduler (Stage 3)
Stage 11: Pipelines Process Items
If you have pipelines, they process each item:
class MyPipeline:
def process_item(self, item, spider):
# Clean or save item
return item
Stage 12: Item is Saved/Exported
Item gets saved to:
- JSON file (if you used
-o output.json) - Database (if you have database pipeline)
- Whatever your pipeline does
Stage 13: Repeat Until Done
If there are more requests in the queue, go back to Stage 4.
If queue is empty, spider closes.
Visual Flow Diagram (Simple Version)
YOU (Spider)
|
| Creates Request
↓
ENGINE (Manager)
|
| Sends to Queue
↓
SCHEDULER (Queue)
|
| Returns Next Request
↓
ENGINE
|
| Sends to Download
↓
DOWNLOADER (Internet)
|
| Fetches Page
↓
ENGINE
|
| Sends Response
↓
YOU (Spider)
|
| Process & Extract Data
↓
ENGINE
|
| Sends Items
↓
PIPELINES (Save Data)
|
↓
FILE/DATABASE
Real Example: Following One Request
Let's trace a real spider step by step.
Your Spider
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
# Extract quotes
for quote in response.css('.quote'):
yield {
'text': quote.css('.text::text').get(),
'author': quote.css('.author::text').get()
}
# Follow next page
next_page = response.css('.next a::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
What Happens (Step by Step)
1. You run: scrapy crawl quotes
2. Spider creates request:
Request: http://quotes.toscrape.com/
Callback: parse
3. Engine → Scheduler:
"Add this request to queue"
4. Scheduler:
Queue now has: [http://quotes.toscrape.com/]
5. Engine asks Scheduler:
"What's next?"
6. Scheduler replies:
"http://quotes.toscrape.com/"
7. Engine → Downloader:
"Go get this page"
8. Downloader:
- Connects to http://quotes.toscrape.com/
- Downloads HTML
- Returns Response
9. Engine → Spider (your parse method):
"Here's the response"
10. Your parse method runs:
def parse(self, response):
# Finds 10 quotes on page
for quote in response.css('.quote'): # Loops 10 times
yield {
'text': 'Quote text here',
'author': 'Author name'
}
# Finds next page link
next_page = '/page/2/'
yield response.follow(next_page, self.parse)
11. You yielded:
- 10 items (quotes)
- 1 new request (page 2)
12. Engine receives:
- 10 items → Send to pipelines
- 1 request → Send to scheduler
13. Items go to pipeline:
- Saved to output.json
14. New request goes to scheduler:
Queue now has: [http://quotes.toscrape.com/page/2/]
15. Repeat from step 5:
- Engine asks for next request
- Gets page 2
- Downloads it
- Spider processes it
- Finds 10 more quotes and page 3
- And so on...
Why Understanding This Matters
Problem 1: Spider Not Following Links
You think: "My spider doesn't work!"
Real reason: You forgot to yield the request
# WRONG (doesn't follow link)
def parse(self, response):
next_page = response.css('.next::attr(href)').get()
# Forgot to yield!
# RIGHT
def parse(self, response):
next_page = response.css('.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse) # This!
Now you know: Requests must be yielded to get added to queue.
Problem 2: Items Not Saving
You think: "Items disappear!"
Real reason: You didn't yield them
# WRONG
def parse(self, response):
item = {'title': response.css('h1::text').get()}
# Forgot to yield!
# RIGHT
def parse(self, response):
item = {'title': response.css('h1::text').get()}
yield item # This!
Now you know: Items must be yielded to go through pipelines.
Problem 3: Duplicate Requests
You think: "Some pages scraped twice!"
Real reason: Used dont_filter=True
# Creates duplicates
yield scrapy.Request(url, dont_filter=True)
# Scheduler filters duplicates automatically
yield scrapy.Request(url)
Now you know: Scheduler automatically removes duplicates (unless you tell it not to).
Common Questions Answered
Q: Why can't I just use requests library?
Answer: You can! But Scrapy handles:
- Queue management (what to scrape next)
- Duplicate filtering (don't scrape same URL twice)
- Concurrent requests (scrape multiple pages at once)
- Retry logic (if download fails)
- Rate limiting (don't overwhelm server)
You'd have to code all this yourself with requests.
Q: How does Scrapy scrape multiple pages at once?
Answer: The downloader can fetch many pages simultaneously.
Settings:
CONCURRENT_REQUESTS = 16 # 16 pages at the same time
While one page is downloading, Scrapy downloads 15 others!
Q: How does Scrapy know not to scrape the same URL twice?
Answer: The scheduler keeps track:
- You yield Request(url='https://example.com/page1')
- Scheduler adds to queue and remembers it
- Later, you yield same URL again
- Scheduler says "Already scraped this!" and ignores it
Q: What if I WANT to scrape the same URL twice?
Answer: Tell scheduler not to filter:
yield scrapy.Request(url, dont_filter=True)
Now scheduler won't check for duplicates.
Q: Why does my spider stop?
Answer: Spider stops when:
- Queue is empty (no more requests)
- You close it manually
- Error occurs (and CLOSESPIDER_ERRORCOUNT reached)
The Complete Flow (All Together)
Let's see everything at once:
1. START
- Spider creates initial requests from start_urls
2. SCHEDULING
- Engine sends requests to scheduler
- Scheduler queues them
- Scheduler removes duplicates
3. DOWNLOADING
- Engine asks scheduler for next request
- Sends request to downloader
- Downloader fetches page from internet
4. PROCESSING
- Downloader returns response to engine
- Engine sends response to spider
- Spider's callback method runs
5. YIELDING
- Spider yields items (go to pipelines)
- Spider yields new requests (go to scheduler)
6. SAVING
- Items go through pipelines
- Pipelines clean, validate, save items
7. REPEAT
- If queue has requests, go to step 3
- If queue empty, spider closes
DONE!
Watching It Happen (Live)
Want to see this in action? Run with verbose logging:
scrapy crawl myspider --loglevel=DEBUG
You'll see:
[scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.com>
[scrapy.core.scraper] DEBUG: Scraped from <200 https://example.com>
{'title': 'Example Domain'}
This shows:
- Engine crawled a URL
- Status code 200 (success)
- Scraper extracted an item
Simple Mental Model
Think of Scrapy like a factory:
You (Spider): The worker who knows what to do
Engine: The manager who coordinates
Scheduler: The to-do list
Downloader: The person who fetches materials
Pipelines: Quality control and packaging
Flow:
- You tell manager what you need
- Manager adds to to-do list
- Manager asks person to fetch it
- Person brings it to you
- You process it and create products
- Products go to quality control
- Products get packaged and shipped
Same concept!
Practice Exercise
Try this and watch the flow:
import scrapy
class SimpleSpider(scrapy.Spider):
name = 'simple'
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
# Print to see when this runs
print(f"Processing: {response.url}")
# Extract one quote
quote = response.css('.quote').get()
if quote:
print("Found a quote!")
yield {'text': 'quote text'}
# Don't follow more pages (keep it simple)
Run it:
scrapy crawl simple
Watch the output. You'll see:
- Spider opens
- Engine starts
- Request sent
- Response received
- Your print statements
- Item scraped
- Spider closes
This shows you the flow in real-time!
Summary
Main parts:
- Spider (your code)
- Engine (manager)
- Scheduler (queue)
- Downloader (fetches pages)
- Pipelines (process items)
Flow:
- Spider creates requests
- Engine sends to scheduler
- Scheduler queues them
- Downloader fetches pages
- Spider processes responses
- Spider yields items and new requests
- Items go to pipelines
- New requests go to scheduler
- Repeat until queue empty
Remember:
- You must YIELD items (or they disappear)
- You must YIELD requests (or links not followed)
- Scheduler removes duplicates automatically
- Everything is coordinated by the engine
That's it!
You don't need to memorize every detail. Just understand:
- Requests go in a queue
- Downloader fetches them
- Your spider processes them
- Items get saved
This mental model will help you debug problems and understand what's happening when you run your spiders.
What to Do Next
1. Run a spider with DEBUG logging:
scrapy crawl myspider --loglevel=DEBUG
Watch the flow in action.
2. Add print statements:
def parse(self, response):
print("I'm processing:", response.url)
yield {'data': 'something'}
print("I yielded an item!")
See when your code actually runs.
3. Experiment:
- Try yielding requests
- Try NOT yielding requests
- See what happens!
The best way to understand is to experiment and watch what happens.
Happy scraping! 🕷️
Top comments (0)