Muhammad Ikramullah Khan

Posted on Dec 21

Understanding Scrapy's Project Structure: A Complete Beginner's Guide

#webdev #python #tutorial #beginners

When I first ran scrapy startproject, I stared at all those folders and files thinking, "What is all this stuff?"

If you're feeling the same way right now, don't worry. You're exactly where I was when I started.

Let me walk you through everything. We'll create a project together, look at each file, and I'll explain what matters and what you can ignore (at least for now). By the end of this, Scrapy's structure will make complete sense.

Sound good? Let's dive in.

Creating Your First Scrapy Project

First, let's create a project so we have something real to look at.

Open your terminal and type:

scrapy startproject bookstore

You can replace "bookstore" with whatever name you want. I'm using bookstore because we're going to scrape book data.

Scrapy creates this structure:

bookstore/
    scrapy.cfg
    bookstore/
        __init__.py
        items.py
        middlewares.py
        pipelines.py
        settings.py
        spiders/
            __init__.py

Yeah, I know. It looks like a lot.

But here's the truth: you'll only work with 2 or 3 of these files regularly. The rest? You can ignore them until you actually need them.

Let me show you what each file does.

The Project Structure Explained

The Outer bookstore/ Folder

This is just a container for your project. Nothing special. It holds everything else.

scrapy.cfg (You Can Ignore This)

What it is: A configuration file for deploying your scraper to servers.

Do you need it? Not unless you're deploying to Scrapyd or similar services.

My advice: Don't even open this file yet. Just skip it.

The Inner bookstore/ Folder

This is your actual Python package. All your code lives here.

Notice it has the same name as your project? That can be confusing at first:

Outer folder = just a container
Inner folder = where the real work happens

init.py (Leave It Alone)

What it is: An empty file that tells Python "this folder is a package."

Do you need to touch it? Nope. Just let it be.

settings.py (Important!)

What it is: Your control panel. This is where you configure how Scrapy behaves.

When you open it, you'll see a bunch of commented-out settings. Don't let that intimidate you.

Here's what you'll actually change:

# Make your scraper look like a real browser
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'

# Be polite - add delays between requests
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True

# Respect robots.txt (be a good internet citizen)
ROBOTSTXT_OBEY = True

That's it. Everything else can stay default for now.

Why this matters: Without a proper USER_AGENT, sites know you're a bot. Without delays, you'll get blocked. It's that simple.

spiders/ Folder (This is Where You'll Live)

What it is: This folder holds all your spider files. Each spider is a separate Python file.

Right now it's empty (except for that init.py file). Let's create your first spider:

cd bookstore
scrapy genspider books books.toscrape.com

This creates a new file: bookstore/spiders/books.py

Here's what Scrapy generated:

import scrapy

class BooksSpider(scrapy.Spider):
    name = "books"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com"]

    def parse(self, response):
        pass

Let me break this down:

name = "books"
This is your spider's ID. You'll use it to run your spider: scrapy crawl books

allowed_domains
Scrapy will only scrape URLs from these domains. It's a safety feature.

start_urls
Where your spider begins. Scrapy automatically visits these URLs.

def parse(self, response)
This is where you write code to extract data. The response variable contains the HTML from the page.

Here's a real working example:

def parse(self, response):
    for book in response.css('article.product_pod'):
        yield {
            'title': book.css('h3 a::attr(title)').get(),
            'price': book.css('.price_color::text').get(),
        }

    # Follow the next page link
    next_page = response.css('li.next a::attr(href)').get()
    if next_page:
        yield response.follow(next_page, self.parse)

This scrapes book titles and prices, then follows pagination links automatically.

items.py (Optional, You Can Skip This)

What it is: Where you define the structure of your data.

Think of it like creating a template. You're saying "every book I scrape will have these fields."

import scrapy

class BookItem(scrapy.Item):
    title = scrapy.Field()
    price = scrapy.Field()
    rating = scrapy.Field()

Do you actually need this? No, not when you're starting out.

You can just return dictionaries from your spider (like {'title': 'Harry Potter', 'price': '$10'}). That works perfectly fine.

Items become useful when:

Your project gets bigger
You want type checking
You're working with a team

My advice: Skip this file for now. Use simple dictionaries. Come back to Items later when you're comfortable with Scrapy basics.

pipelines.py (Use When You Need It)

What it is: Where you process, clean, and store your scraped data.

Think of it as an assembly line. Your spider scrapes raw data, and pipelines clean it up and save it somewhere.

class CleanDataPipeline:
    def process_item(self, item, spider):
        # Clean the price
        item['price'] = item['price'].replace('$', '').strip()
        item['price'] = float(item['price'])
        return item

Do you need this? Not immediately.

Scrapy can save your data to JSON or CSV without any pipelines. Just run:

scrapy crawl books -o books.json

Use pipelines when you need to:

Clean or transform data
Save to a database
Remove duplicates
Validate data

My advice: Start without pipelines. Add them when your data needs processing.

middlewares.py (Advanced Stuff)

What it is: Code that intercepts requests before they're sent and responses before they reach your spider.

Middlewares let you:

Rotate user agents
Use proxies
Add custom headers
Handle retries differently

Do you need this? Probably not for your first 10 projects.

This is advanced territory. You'll know when you need it (usually when sites start blocking you and you need to get creative).

My advice: Ignore this file completely for now.

The Commands You'll Actually Use

Now that you understand the structure, here are the commands you'll run every day:

Create a New Project

scrapy startproject project_name

Example:

scrapy startproject amazon_scraper

Create a New Spider

cd project_name
scrapy genspider spider_name domain.com

Example:

cd amazon_scraper
scrapy genspider products amazon.com

List All Your Spiders

scrapy list

Shows all spiders in your project.

Run a Spider

scrapy crawl spider_name

Example:

scrapy crawl books

Save Output to a File

scrapy crawl spider_name -o filename.json

Examples:

scrapy crawl books -o books.json
scrapy crawl books -o books.csv
scrapy crawl books -o books.xml

The file appears in your project root directory.

Test Your Selectors (Super Useful!)

scrapy shell "https://example.com"

This opens an interactive shell where you can test CSS selectors:

>>> response.css('h1::text').get()
'Welcome to Example'

>>> response.css('.price::text').getall()
['$10.99', '$12.50', '$8.99']

This is incredibly useful. Test your selectors here before putting them in your spider. It'll save you hours of debugging.

View a Page (As Scrapy Sees It)

scrapy view "https://example.com"

Opens the page in your browser showing exactly what Scrapy downloaded. Great for debugging.

A Real Example: Building Your First Spider

Let me show you the complete workflow from start to finish.

Step 1: Create the Project

scrapy startproject quotes
cd quotes

Step 2: Update Settings

Open quotes/settings.py and change these lines:

USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
DOWNLOAD_DELAY = 1
ROBOTSTXT_OBEY = True

Step 3: Create a Spider

scrapy genspider quote_spider quotes.toscrape.com

Step 4: Test Your Selectors First

scrapy shell "https://quotes.toscrape.com"

Try these:

>>> response.css('div.quote span.text::text').get()
>>> response.css('small.author::text').get()
>>> response.css('div.tags a.tag::text').getall()

Perfect! These selectors work.

Step 5: Write Your Spider

Open quotes/spiders/quote_spider.py and replace the code:

import scrapy

class QuoteSpiderSpider(scrapy.Spider):
    name = "quote_spider"
    start_urls = ["https://quotes.toscrape.com"]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        # Follow pagination
        next_page = response.css('li.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

Step 6: Run It

scrapy crawl quote_spider -o quotes.json

Step 7: Check Your Data

Open quotes.json and boom! You've got all the quotes.

Congratulations! You just built a complete web scraper.

What You Actually Need to Remember

Here's the reality: you don't need to understand every file right now.

Focus on these:

spiders/ folder (where your scraping code lives)
settings.py (set USER_AGENT and DOWNLOAD_DELAY)
scrapy crawl command (how you run spiders)
scrapy shell command (for testing selectors)

You can ignore:

scrapy.cfg (deployment stuff)
items.py (use dictionaries instead)
pipelines.py (until you need data processing)
middlewares.py (advanced features)

Start simple. Get a spider working. Add complexity later when you actually need it.

Common Questions Beginners Ask

Q: Do I need to use items.py?
Nope. Dictionaries work perfectly fine. Use Items when your project gets bigger or you want more structure.

Q: When do I need pipelines?
When you need to clean data, save to databases, or remove duplicates. Otherwise, -o output.json works great.

Q: Can I have multiple spiders in one project?
Yes! That's actually the point. Put all related spiders in one project. Each spider gets its own file in the spiders/ folder.

Q: How do I run a specific spider?
Use the name from your spider class: scrapy crawl spider_name

Q: My spider isn't finding anything. What's wrong?
99% of the time, your CSS selectors are wrong. Use scrapy shell to test them first.

Q: How do I know what CSS selector to use?
Open the page in your browser, right-click on what you want to scrape, and select "Inspect Element." Look at the HTML structure.

The Visual Overview

Here's what your project looks like with everything in place:

bookstore/                      # Project container
│
├── scrapy.cfg                 # Deployment config (ignore)
│
└── bookstore/                 # Your Python package
    │
    ├── __init__.py           # Package marker (ignore)
    │
    ├── items.py              # Data structures (optional)
    │
    ├── middlewares.py        # Advanced features (later)
    │
    ├── pipelines.py          # Data processing (when needed)
    │
    ├── settings.py           # CONFIGURE THIS
    │   └── Set: USER_AGENT, DOWNLOAD_DELAY
    │
    └── spiders/              # YOUR CODE LIVES HERE
        │
        ├── __init__.py       # Package marker (ignore)
        │
        ├── books.py          # Your book spider
        ├── reviews.py        # Maybe a reviews spider
        └── authors.py        # Maybe an authors spider

Your Workflow (Reality Check)

Here's what you'll actually do every time you start a new scraping project:

Create project: scrapy startproject myproject
Update settings.py (USER_AGENT, DOWNLOAD_DELAY)
Create spider: scrapy genspider myspider example.com
Test selectors: scrapy shell "https://example.com"
Write your spider code
Run it: scrapy crawl myspider -o data.json
Check your data

That's it. That's the whole workflow.

You'll spend 90% of your time writing and debugging spider code. Everything else is just setup.

One More Thing

Don't try to learn everything at once. I made that mistake when I started.

I tried to understand Items, Pipelines, Middlewares, and all the settings before I'd even built a working spider. It was overwhelming and unnecessary.

Instead:

Build a simple spider that works
Get comfortable with CSS selectors
Run a few successful scrapes
Then (and only then) start adding complexity

Master the basics first. Everything else can wait.

Final Thoughts

Scrapy's project structure looks intimidating, but you're really only working with two things:

Your spider code (in the spiders/ folder)
Basic settings (in settings.py)

Everything else? You'll learn it when you need it.

Start with a simple spider. Get it working. Feel that success. Then build from there.

That's how everyone learns Scrapy. That's how I learned it. And that's how you'll learn it too.

Now go build something!

Got questions? Drop them in the comments. We're all learning together, and there's no such thing as a stupid question.

Happy scraping! 🕷️

DEV Community

Understanding Scrapy's Project Structure: A Complete Beginner's Guide

Creating Your First Scrapy Project

The Project Structure Explained

The Outer bookstore/ Folder

scrapy.cfg (You Can Ignore This)

The Inner bookstore/ Folder

init.py (Leave It Alone)

settings.py (Important!)

spiders/ Folder (This is Where You'll Live)

items.py (Optional, You Can Skip This)

pipelines.py (Use When You Need It)

middlewares.py (Advanced Stuff)

The Commands You'll Actually Use

Create a New Project

Create a New Spider

List All Your Spiders

Run a Spider

Save Output to a File

Test Your Selectors (Super Useful!)

View a Page (As Scrapy Sees It)

A Real Example: Building Your First Spider

Step 1: Create the Project

Step 2: Update Settings

Step 3: Create a Spider

Step 4: Test Your Selectors First

Step 5: Write Your Spider

Step 6: Run It

Step 7: Check Your Data

What You Actually Need to Remember

Common Questions Beginners Ask

The Visual Overview

Your Workflow (Reality Check)

One More Thing

Final Thoughts

Top comments (0)