DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

How Scrapy Actually Works: A Beginner's Guide

When I first started with Scrapy, it felt like magic. I wrote a spider, ran it, and data appeared. But I had no idea what was happening behind the scenes.

Then my spider broke. I couldn't debug it because I didn't understand how Scrapy actually worked. After finally learning the internal flow, everything clicked. Debugging became easy.

Let me explain how Scrapy works in the simplest way possible, so you understand what's really happening.


The Big Picture (Super Simple)

Imagine you're at a restaurant:

  1. You (the spider) tell the waiter (Scrapy engine) what you want
  2. Waiter writes it down on a list (scheduler)
  3. Waiter takes the order to the kitchen (downloader)
  4. Kitchen makes your food and gives it to waiter
  5. Waiter brings food to you
  6. You eat it and maybe order more

That's basically how Scrapy works!

Let's break it down step by step.


The Main Parts of Scrapy

Before we follow a request, understand the main parts:

1. Spider (You)

  • Your code
  • Decides what to scrape
  • Processes the data

2. Engine (The Manager)

  • Coordinates everything
  • Tells everyone what to do
  • Makes sure things run smoothly

3. Scheduler (The Queue)

  • Keeps a list of URLs to scrape
  • Decides what to scrape next
  • Removes duplicates

4. Downloader (The Internet Fetcher)

  • Downloads web pages
  • Handles the actual HTTP requests
  • Gets the HTML for you

5. Pipelines (Data Processors)

  • Clean your data
  • Save to database or file
  • Validate items

That's it. Just 5 main parts.


Step-by-Step: What Happens When You Run Scrapy

Let's follow one single request from start to finish.

Step 1: You Write a Spider

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['https://example.com']

    def parse(self, response):
        yield {'title': response.css('h1::text').get()}
Enter fullscreen mode Exit fullscreen mode

Simple spider that scrapes a title.

Step 2: You Run the Spider

scrapy crawl myspider
Enter fullscreen mode Exit fullscreen mode

Now watch what happens inside Scrapy...


The Journey of Your Request

Stage 1: Spider Creates Request

Your spider says: "I want to scrape https://example.com"

start_urls = ['https://example.com']
Enter fullscreen mode Exit fullscreen mode

Scrapy converts this to a Request object:

Request(url='https://example.com', callback=self.parse)
Enter fullscreen mode Exit fullscreen mode

What this means:

Stage 2: Engine Takes Request

The Engine receives the request and says: "Okay, got a new request!"

Stage 3: Engine Gives Request to Scheduler

Engine tells Scheduler: "Add this to the queue"

Scheduler says: "Added! This is request #1 in the queue"

What Scheduler does:

  • Stores the request
  • Checks if it's a duplicate (if so, ignores it)
  • Puts it in line to be processed

Stage 4: Engine Asks Scheduler for Next Request

Engine asks: "What's the next request to process?"

Scheduler says: "Here you go, https://example.com"

Stage 5: Engine Sends Request to Downloader

Engine tells Downloader: "Go get this page for me"

Stage 6: Downloader Fetches the Page

Downloader goes to the internet:

This is where the actual internet connection happens!

Stage 7: Downloader Returns Response

Downloader says: "Got it! Here's the HTML"

Returns a Response object with:

  • Status code (200, 404, etc.)
  • HTML content
  • Headers
  • URL

Stage 8: Engine Sends Response to Spider

Engine tells your spider: "Here's the response you asked for"

Calls your parse method:

def parse(self, response):
    # You're here now!
    yield {'title': response.css('h1::text').get()}
Enter fullscreen mode Exit fullscreen mode

Stage 9: Spider Processes Response

Your spider:

  • Extracts the title
  • Creates an item (dictionary)
  • Yields it
yield {'title': 'Example Domain'}
Enter fullscreen mode Exit fullscreen mode

If you wanted to scrape more pages, you'd also yield new requests:

yield {'title': response.css('h1::text').get()}
yield scrapy.Request('https://example.com/page2', callback=self.parse)
Enter fullscreen mode Exit fullscreen mode

Stage 10: Engine Receives Items and Requests

Engine receives what you yielded:

  • Items go to pipelines
  • New requests go back to scheduler (Stage 3)

Stage 11: Pipelines Process Items

If you have pipelines, they process each item:

class MyPipeline:
    def process_item(self, item, spider):
        # Clean or save item
        return item
Enter fullscreen mode Exit fullscreen mode

Stage 12: Item is Saved/Exported

Item gets saved to:

  • JSON file (if you used -o output.json)
  • Database (if you have database pipeline)
  • Whatever your pipeline does

Stage 13: Repeat Until Done

If there are more requests in the queue, go back to Stage 4.

If queue is empty, spider closes.


Visual Flow Diagram (Simple Version)

YOU (Spider)
    |
    | Creates Request
    ↓
ENGINE (Manager)
    |
    | Sends to Queue
    ↓
SCHEDULER (Queue)
    |
    | Returns Next Request
    ↓
ENGINE
    |
    | Sends to Download
    ↓
DOWNLOADER (Internet)
    |
    | Fetches Page
    ↓
ENGINE
    |
    | Sends Response
    ↓
YOU (Spider)
    |
    | Process & Extract Data
    ↓
ENGINE
    |
    | Sends Items
    ↓
PIPELINES (Save Data)
    |
    ↓
FILE/DATABASE
Enter fullscreen mode Exit fullscreen mode

Real Example: Following One Request

Let's trace a real spider step by step.

Your Spider

import scrapy

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # Extract quotes
        for quote in response.css('.quote'):
            yield {
                'text': quote.css('.text::text').get(),
                'author': quote.css('.author::text').get()
            }

        # Follow next page
        next_page = response.css('.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)
Enter fullscreen mode Exit fullscreen mode

What Happens (Step by Step)

1. You run: scrapy crawl quotes

2. Spider creates request:

Request: http://quotes.toscrape.com/
Callback: parse
Enter fullscreen mode Exit fullscreen mode

3. Engine → Scheduler:
"Add this request to queue"

4. Scheduler:
Queue now has: [http://quotes.toscrape.com/]

5. Engine asks Scheduler:
"What's next?"

6. Scheduler replies:
"http://quotes.toscrape.com/"

7. Engine → Downloader:
"Go get this page"

8. Downloader:

9. Engine → Spider (your parse method):
"Here's the response"

10. Your parse method runs:

def parse(self, response):
    # Finds 10 quotes on page
    for quote in response.css('.quote'):  # Loops 10 times
        yield {
            'text': 'Quote text here',
            'author': 'Author name'
        }

    # Finds next page link
    next_page = '/page/2/'
    yield response.follow(next_page, self.parse)
Enter fullscreen mode Exit fullscreen mode

11. You yielded:

  • 10 items (quotes)
  • 1 new request (page 2)

12. Engine receives:

  • 10 items → Send to pipelines
  • 1 request → Send to scheduler

13. Items go to pipeline:

  • Saved to output.json

14. New request goes to scheduler:
Queue now has: [http://quotes.toscrape.com/page/2/]

15. Repeat from step 5:

  • Engine asks for next request
  • Gets page 2
  • Downloads it
  • Spider processes it
  • Finds 10 more quotes and page 3
  • And so on...

Why Understanding This Matters

Problem 1: Spider Not Following Links

You think: "My spider doesn't work!"

Real reason: You forgot to yield the request

# WRONG (doesn't follow link)
def parse(self, response):
    next_page = response.css('.next::attr(href)').get()
    # Forgot to yield!

# RIGHT
def parse(self, response):
    next_page = response.css('.next::attr(href)').get()
    if next_page:
        yield response.follow(next_page, self.parse)  # This!
Enter fullscreen mode Exit fullscreen mode

Now you know: Requests must be yielded to get added to queue.

Problem 2: Items Not Saving

You think: "Items disappear!"

Real reason: You didn't yield them

# WRONG
def parse(self, response):
    item = {'title': response.css('h1::text').get()}
    # Forgot to yield!

# RIGHT
def parse(self, response):
    item = {'title': response.css('h1::text').get()}
    yield item  # This!
Enter fullscreen mode Exit fullscreen mode

Now you know: Items must be yielded to go through pipelines.

Problem 3: Duplicate Requests

You think: "Some pages scraped twice!"

Real reason: Used dont_filter=True

# Creates duplicates
yield scrapy.Request(url, dont_filter=True)

# Scheduler filters duplicates automatically
yield scrapy.Request(url)
Enter fullscreen mode Exit fullscreen mode

Now you know: Scheduler automatically removes duplicates (unless you tell it not to).


Common Questions Answered

Q: Why can't I just use requests library?

Answer: You can! But Scrapy handles:

  • Queue management (what to scrape next)
  • Duplicate filtering (don't scrape same URL twice)
  • Concurrent requests (scrape multiple pages at once)
  • Retry logic (if download fails)
  • Rate limiting (don't overwhelm server)

You'd have to code all this yourself with requests.

Q: How does Scrapy scrape multiple pages at once?

Answer: The downloader can fetch many pages simultaneously.

Settings:

CONCURRENT_REQUESTS = 16  # 16 pages at the same time
Enter fullscreen mode Exit fullscreen mode

While one page is downloading, Scrapy downloads 15 others!

Q: How does Scrapy know not to scrape the same URL twice?

Answer: The scheduler keeps track:

  1. You yield Request(url='https://example.com/page1')
  2. Scheduler adds to queue and remembers it
  3. Later, you yield same URL again
  4. Scheduler says "Already scraped this!" and ignores it

Q: What if I WANT to scrape the same URL twice?

Answer: Tell scheduler not to filter:

yield scrapy.Request(url, dont_filter=True)
Enter fullscreen mode Exit fullscreen mode

Now scheduler won't check for duplicates.

Q: Why does my spider stop?

Answer: Spider stops when:

  • Queue is empty (no more requests)
  • You close it manually
  • Error occurs (and CLOSESPIDER_ERRORCOUNT reached)

The Complete Flow (All Together)

Let's see everything at once:

1. START

  • Spider creates initial requests from start_urls

2. SCHEDULING

  • Engine sends requests to scheduler
  • Scheduler queues them
  • Scheduler removes duplicates

3. DOWNLOADING

  • Engine asks scheduler for next request
  • Sends request to downloader
  • Downloader fetches page from internet

4. PROCESSING

  • Downloader returns response to engine
  • Engine sends response to spider
  • Spider's callback method runs

5. YIELDING

  • Spider yields items (go to pipelines)
  • Spider yields new requests (go to scheduler)

6. SAVING

  • Items go through pipelines
  • Pipelines clean, validate, save items

7. REPEAT

  • If queue has requests, go to step 3
  • If queue empty, spider closes

DONE!


Watching It Happen (Live)

Want to see this in action? Run with verbose logging:

scrapy crawl myspider --loglevel=DEBUG
Enter fullscreen mode Exit fullscreen mode

You'll see:

[scrapy.core.engine] DEBUG: Crawled (200) <GET https://example.com>
[scrapy.core.scraper] DEBUG: Scraped from <200 https://example.com>
{'title': 'Example Domain'}
Enter fullscreen mode Exit fullscreen mode

This shows:

  • Engine crawled a URL
  • Status code 200 (success)
  • Scraper extracted an item

Simple Mental Model

Think of Scrapy like a factory:

You (Spider): The worker who knows what to do

Engine: The manager who coordinates

Scheduler: The to-do list

Downloader: The person who fetches materials

Pipelines: Quality control and packaging

Flow:

  1. You tell manager what you need
  2. Manager adds to to-do list
  3. Manager asks person to fetch it
  4. Person brings it to you
  5. You process it and create products
  6. Products go to quality control
  7. Products get packaged and shipped

Same concept!


Practice Exercise

Try this and watch the flow:

import scrapy

class SimpleSpider(scrapy.Spider):
    name = 'simple'
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # Print to see when this runs
        print(f"Processing: {response.url}")

        # Extract one quote
        quote = response.css('.quote').get()
        if quote:
            print("Found a quote!")
            yield {'text': 'quote text'}

        # Don't follow more pages (keep it simple)
Enter fullscreen mode Exit fullscreen mode

Run it:

scrapy crawl simple
Enter fullscreen mode Exit fullscreen mode

Watch the output. You'll see:

  1. Spider opens
  2. Engine starts
  3. Request sent
  4. Response received
  5. Your print statements
  6. Item scraped
  7. Spider closes

This shows you the flow in real-time!


Summary

Main parts:

  • Spider (your code)
  • Engine (manager)
  • Scheduler (queue)
  • Downloader (fetches pages)
  • Pipelines (process items)

Flow:

  1. Spider creates requests
  2. Engine sends to scheduler
  3. Scheduler queues them
  4. Downloader fetches pages
  5. Spider processes responses
  6. Spider yields items and new requests
  7. Items go to pipelines
  8. New requests go to scheduler
  9. Repeat until queue empty

Remember:

  • You must YIELD items (or they disappear)
  • You must YIELD requests (or links not followed)
  • Scheduler removes duplicates automatically
  • Everything is coordinated by the engine

That's it!

You don't need to memorize every detail. Just understand:

  • Requests go in a queue
  • Downloader fetches them
  • Your spider processes them
  • Items get saved

This mental model will help you debug problems and understand what's happening when you run your spiders.


What to Do Next

1. Run a spider with DEBUG logging:

scrapy crawl myspider --loglevel=DEBUG
Enter fullscreen mode Exit fullscreen mode

Watch the flow in action.

2. Add print statements:

def parse(self, response):
    print("I'm processing:", response.url)
    yield {'data': 'something'}
    print("I yielded an item!")
Enter fullscreen mode Exit fullscreen mode

See when your code actually runs.

3. Experiment:

  • Try yielding requests
  • Try NOT yielding requests
  • See what happens!

The best way to understand is to experiment and watch what happens.

Happy scraping! 🕷️

Top comments (0)