Muhammad Ikramullah Khan

Posted on Jan 17

How Scrapy Actually Works: A Beginner's Guide

#webdev #programming #beginners #python

When I first started with Scrapy, it felt like magic. I wrote a spider, ran it, and data appeared. But I had no idea what was happening behind the scenes.

Then my spider broke. I couldn't debug it because I didn't understand how Scrapy actually worked. After finally learning the internal flow, everything clicked. Debugging became easy.

Let me explain how Scrapy works in the simplest way possible, so you understand what's really happening.

The Big Picture (Super Simple)

Imagine you're at a restaurant:

You (the spider) tell the waiter (Scrapy engine) what you want
Waiter writes it down on a list (scheduler)
Waiter takes the order to the kitchen (downloader)
Kitchen makes your food and gives it to waiter
Waiter brings food to you
You eat it and maybe order more

That's basically how Scrapy works!

Let's break it down step by step.

The Main Parts of Scrapy

Before we follow a request, understand the main parts:

1. Spider (You)

Your code
Decides what to scrape
Processes the data

2. Engine (The Manager)

Coordinates everything
Tells everyone what to do
Makes sure things run smoothly

3. Scheduler (The Queue)

Keeps a list of URLs to scrape
Decides what to scrape next
Removes duplicates

4. Downloader (The Internet Fetcher)

Downloads web pages
Handles the actual HTTP requests
Gets the HTML for you

5. Pipelines (Data Processors)

Clean your data
Save to database or file
Validate items

That's it. Just 5 main parts.

Step-by-Step: What Happens When You Run Scrapy

Let's follow one single request from start to finish.

Step 1: You Write a Spider

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['https://example.com']

    def parse(self, response):
        yield {'title': response.css('h1::text').get()}

Simple spider that scrapes a title.

Step 2: You Run the Spider

scrapy crawl myspider

Now watch what happens inside Scrapy...

The Journey of Your Request

Stage 1: Spider Creates Request

Your spider says: "I want to scrape https://example.com"

start_urls = ['https://example.com']

Scrapy converts this to a Request object:

Request(url='https://example.com', callback=self.parse)

What this means:

URL to fetch: https://example.com
What to do with response: call parse method

Stage 2: Engine Takes Request

The Engine receives the request and says: "Okay, got a new request!"

Stage 3: Engine Gives Request to Scheduler

Engine tells Scheduler: "Add this to the queue"

Scheduler says: "Added! This is request #1 in the queue"

What Scheduler does:

Stores the request
Checks if it's a duplicate (if so, ignores it)
Puts it in line to be processed

Stage 4: Engine Asks Scheduler for Next Request

Engine asks: "What's the next request to process?"

Scheduler says: "Here you go, https://example.com"

Stage 5: Engine Sends Request to Downloader

Engine tells Downloader: "Go get this page for me"

Stage 6: Downloader Fetches the Page

Downloader goes to the internet:

Makes HTTP request to https://example.com
Waits for response
Downloads the HTML

This is where the actual internet connection happens!

Stage 7: Downloader Returns Response

Downloader says: "Got it! Here's the HTML"

Returns a Response object with:

Status code (200, 404, etc.)
HTML content
Headers
URL

Stage 8: Engine Sends Response to Spider

Engine tells your spider: "Here's the response you asked for"

Calls your parse method:

def parse(self, response):
    # You're here now!
    yield {'title': response.css('h1::text').get()}

Stage 9: Spider Processes Response

Your spider:

Extracts the title
Creates an item (dictionary)
Yields it

yield {'title': 'Example Domain'}

If you wanted to scrape more pages, you'd also yield new requests:

yield {'title': response.css('h1::text').get()}
yield scrapy.Request('https://example.com/page2', callback=self.parse)

Stage 10: Engine Receives Items and Requests

Engine receives what you yielded:

Items go to pipelines
New requests go back to scheduler (Stage 3)

Stage 11: Pipelines Process Items

If you have pipelines, they process each item:

class MyPipeline:
    def process_item(self, item, spider):
        # Clean or save item
        return item

Stage 12: Item is Saved/Exported

Item gets saved to:

JSON file (if you used -o output.json)
Database (if you have database pipeline)
Whatever your pipeline does

Stage 13: Repeat Until Done

If there are more requests in the queue, go back to Stage 4.

If queue is empty, spider closes.

Visual Flow Diagram (Simple Version)

YOU (Spider)
    |
    | Creates Request
    ↓
ENGINE (Manager)
    |
    | Sends to Queue
    ↓
SCHEDULER (Queue)
    |
    | Returns Next Request
    ↓
ENGINE
    |
    | Sends to Download
    ↓
DOWNLOADER (Internet)
    |
    | Fetches Page
    ↓
ENGINE
    |
    | Sends Response
    ↓
YOU (Spider)
    |
    | Process & Extract Data
    ↓
ENGINE
    |
    | Sends Items
    ↓
PIPELINES (Save Data)
    |
    ↓
FILE/DATABASE

Real Example: Following One Request

Let's trace a real spider step by step.

Your Spider

import scrapy

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # Extract quotes
        for quote in response.css('.quote'):
            yield {
                'text': quote.css('.text::text').get(),
                'author': quote.css('.author::text').get()
            }

        # Follow next page
        next_page = response.css('.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

What Happens (Step by Step)

1. You run: scrapy crawl quotes

2. Spider creates request:

Request: http://quotes.toscrape.com/
Callback: parse

3. Engine → Scheduler:
"Add this request to queue"

4. Scheduler:
Queue now has: [http://quotes.toscrape.com/]

5. Engine asks Scheduler:
"What's next?"

6. Scheduler replies:
"http://quotes.toscrape.com/"

7. Engine → Downloader:
"Go get this page"

8. Downloader:

Connects to http://quotes.toscrape.com/
Downloads HTML
Returns Response

9. Engine → Spider (your parse method):
"Here's the response"

10. Your parse method runs:

def parse(self, response):
    # Finds 10 quotes on page
    for quote in response.css('.quote'):  # Loops 10 times
        yield {
            'text': 'Quote text here',
            'author': 'Author name'
        }

    # Finds next page link
    next_page = '/page/2/'
    yield response.follow(next_page, self.parse)

11. You yielded:

10 items (quotes)
1 new request (page 2)

12. Engine receives:

10 items → Send to pipelines
1 request → Send to scheduler

13. Items go to pipeline:

Saved to output.json

14. New request goes to scheduler:
Queue now has: [http://quotes.toscrape.com/page/2/]

15. Repeat from step 5:

Engine asks for next request
Gets page 2
Downloads it
Spider processes it
Finds 10 more quotes and page 3
And so on...

Why Understanding This Matters

Problem 1: Spider Not Following Links

You think: "My spider doesn't work!"

Real reason: You forgot to yield the request

# WRONG (doesn't follow link)
def parse(self, response):
    next_page = response.css('.next::attr(href)').get()
    # Forgot to yield!

# RIGHT
def parse(self, response):
    next_page = response.css('.next::attr(href)').get()
    if next_page:
        yield response.follow(next_page, self.parse)  # This!

Now you know: Requests must be yielded to get added to queue.

Problem 2: Items Not Saving

You think: "Items disappear!"

Real reason: You didn't yield them

# WRONG
def parse(self, response):
    item = {'title': response.css('h1::text').get()}
    # Forgot to yield!

# RIGHT
def parse(self, response):
    item = {'title': response.css('h1::text').get()}
    yield item  # This!

Now you know: Items must be yielded to go through pipelines.

Problem 3: Duplicate Requests

You think: "Some pages scraped twice!"

Real reason: Used dont_filter=True

# Creates duplicates
yield scrapy.Request(url, dont_filter=True)

# Scheduler filters duplicates automatically
yield scrapy.Request(url)

Now you know: Scheduler automatically removes duplicates (unless you tell it not to).

Common Questions Answered

Q: Why can't I just use requests library?

Answer: You can! But Scrapy handles:

Queue management (what to scrape next)
Duplicate filtering (don't scrape same URL twice)
Concurrent requests (scrape multiple pages at once)
Retry logic (if download fails)
Rate limiting (don't overwhelm server)

You'd have to code all this yourself with requests.

Q: How does Scrapy scrape multiple pages at once?

Answer: The downloader can fetch many pages simultaneously.

Settings:

CONCURRENT_REQUESTS = 16  # 16 pages at the same time

While one page is downloading, Scrapy downloads 15 others!

Q: How does Scrapy know not to scrape the same URL twice?

Answer: The scheduler keeps track:

You yield Request(url='https://example.com/page1')
Scheduler adds to queue and remembers it
Later, you yield same URL again
Scheduler says "Already scraped this!" and ignores it

Q: What if I WANT to scrape the same URL twice?

Answer: Tell scheduler not to filter:

yield scrapy.Request(url, dont_filter=True)

Now scheduler won't check for duplicates.

Q: Why does my spider stop?

Answer: Spider stops when:

Queue is empty (no more requests)
You close it manually
Error occurs (and CLOSESPIDER_ERRORCOUNT reached)

The Complete Flow (All Together)

Let's see everything at once:

1. START

Spider creates initial requests from start_urls

2. SCHEDULING

Engine sends requests to scheduler
Scheduler queues them
Scheduler removes duplicates

3. DOWNLOADING

Engine asks scheduler for next request
Sends request to downloader
Downloader fetches page from internet

4. PROCESSING

Downloader returns response to engine
Engine sends response to spider
Spider's callback method runs

5. YIELDING

Spider yields items (go to pipelines)
Spider yields new requests (go to scheduler)

6. SAVING

Items go through pipelines
Pipelines clean, validate, save items

7. REPEAT

If queue has requests, go to step 3
If queue empty, spider closes

DONE!

Watching It Happen (Live)

Want to see this in action? Run with verbose logging:

scrapy crawl myspider --loglevel=DEBUG

You'll see:

[scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.com>
[scrapy.core.scraper] DEBUG: Scraped from <200 https://example.com>
{'title': 'Example Domain'}

This shows:

Engine crawled a URL
Status code 200 (success)
Scraper extracted an item

Simple Mental Model

Think of Scrapy like a factory:

You (Spider): The worker who knows what to do

Engine: The manager who coordinates

Scheduler: The to-do list

Downloader: The person who fetches materials

Pipelines: Quality control and packaging

Flow:

You tell manager what you need
Manager adds to to-do list
Manager asks person to fetch it
Person brings it to you
You process it and create products
Products go to quality control
Products get packaged and shipped

Same concept!

Practice Exercise

Try this and watch the flow:

import scrapy

class SimpleSpider(scrapy.Spider):
    name = 'simple'
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # Print to see when this runs
        print(f"Processing: {response.url}")

        # Extract one quote
        quote = response.css('.quote').get()
        if quote:
            print("Found a quote!")
            yield {'text': 'quote text'}

        # Don't follow more pages (keep it simple)

Run it:

scrapy crawl simple

Watch the output. You'll see:

Spider opens
Engine starts
Request sent
Response received
Your print statements
Item scraped
Spider closes

This shows you the flow in real-time!

Summary

Main parts:

Spider (your code)
Engine (manager)
Scheduler (queue)
Downloader (fetches pages)
Pipelines (process items)

Flow:

Spider creates requests
Engine sends to scheduler
Scheduler queues them
Downloader fetches pages
Spider processes responses
Spider yields items and new requests
Items go to pipelines
New requests go to scheduler
Repeat until queue empty

Remember:

You must YIELD items (or they disappear)
You must YIELD requests (or links not followed)
Scheduler removes duplicates automatically
Everything is coordinated by the engine

That's it!

You don't need to memorize every detail. Just understand:

Requests go in a queue
Downloader fetches them
Your spider processes them
Items get saved

This mental model will help you debug problems and understand what's happening when you run your spiders.

What to Do Next

1. Run a spider with DEBUG logging:

scrapy crawl myspider --loglevel=DEBUG

Watch the flow in action.

2. Add print statements:

def parse(self, response):
    print("I'm processing:", response.url)
    yield {'data': 'something'}
    print("I yielded an item!")

See when your code actually runs.

3. Experiment:

Try yielding requests
Try NOT yielding requests
See what happens!

The best way to understand is to experiment and watch what happens.

Happy scraping! 🕷️