When I first ran scrapy startproject, I stared at all those folders and files thinking, "What is all this stuff?"
If you're feeling the same way right now, don't worry. You're exactly where I was when I started.
Let me walk you through everything. We'll create a project together, look at each file, and I'll explain what matters and what you can ignore (at least for now). By the end of this, Scrapy's structure will make complete sense.
Sound good? Let's dive in.
Creating Your First Scrapy Project
First, let's create a project so we have something real to look at.
Open your terminal and type:
scrapy startproject bookstore
You can replace "bookstore" with whatever name you want. I'm using bookstore because we're going to scrape book data.
Scrapy creates this structure:
bookstore/
scrapy.cfg
bookstore/
__init__.py
items.py
middlewares.py
pipelines.py
settings.py
spiders/
__init__.py
Yeah, I know. It looks like a lot.
But here's the truth: you'll only work with 2 or 3 of these files regularly. The rest? You can ignore them until you actually need them.
Let me show you what each file does.
The Project Structure Explained
The Outer bookstore/ Folder
This is just a container for your project. Nothing special. It holds everything else.
scrapy.cfg (You Can Ignore This)
What it is: A configuration file for deploying your scraper to servers.
Do you need it? Not unless you're deploying to Scrapyd or similar services.
My advice: Don't even open this file yet. Just skip it.
The Inner bookstore/ Folder
This is your actual Python package. All your code lives here.
Notice it has the same name as your project? That can be confusing at first:
- Outer folder = just a container
- Inner folder = where the real work happens
init.py (Leave It Alone)
What it is: An empty file that tells Python "this folder is a package."
Do you need to touch it? Nope. Just let it be.
settings.py (Important!)
What it is: Your control panel. This is where you configure how Scrapy behaves.
When you open it, you'll see a bunch of commented-out settings. Don't let that intimidate you.
Here's what you'll actually change:
# Make your scraper look like a real browser
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
# Be polite - add delays between requests
DOWNLOAD_DELAY = 2
RANDOMIZE_DOWNLOAD_DELAY = True
# Respect robots.txt (be a good internet citizen)
ROBOTSTXT_OBEY = True
That's it. Everything else can stay default for now.
Why this matters: Without a proper USER_AGENT, sites know you're a bot. Without delays, you'll get blocked. It's that simple.
spiders/ Folder (This is Where You'll Live)
What it is: This folder holds all your spider files. Each spider is a separate Python file.
Right now it's empty (except for that init.py file). Let's create your first spider:
cd bookstore
scrapy genspider books books.toscrape.com
This creates a new file: bookstore/spiders/books.py
Here's what Scrapy generated:
import scrapy
class BooksSpider(scrapy.Spider):
name = "books"
allowed_domains = ["books.toscrape.com"]
start_urls = ["https://books.toscrape.com"]
def parse(self, response):
pass
Let me break this down:
name = "books"
This is your spider's ID. You'll use it to run your spider: scrapy crawl books
allowed_domains
Scrapy will only scrape URLs from these domains. It's a safety feature.
start_urls
Where your spider begins. Scrapy automatically visits these URLs.
def parse(self, response)
This is where you write code to extract data. The response variable contains the HTML from the page.
Here's a real working example:
def parse(self, response):
for book in response.css('article.product_pod'):
yield {
'title': book.css('h3 a::attr(title)').get(),
'price': book.css('.price_color::text').get(),
}
# Follow the next page link
next_page = response.css('li.next a::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
This scrapes book titles and prices, then follows pagination links automatically.
items.py (Optional, You Can Skip This)
What it is: Where you define the structure of your data.
Think of it like creating a template. You're saying "every book I scrape will have these fields."
import scrapy
class BookItem(scrapy.Item):
title = scrapy.Field()
price = scrapy.Field()
rating = scrapy.Field()
Do you actually need this? No, not when you're starting out.
You can just return dictionaries from your spider (like {'title': 'Harry Potter', 'price': '$10'}). That works perfectly fine.
Items become useful when:
- Your project gets bigger
- You want type checking
- You're working with a team
My advice: Skip this file for now. Use simple dictionaries. Come back to Items later when you're comfortable with Scrapy basics.
pipelines.py (Use When You Need It)
What it is: Where you process, clean, and store your scraped data.
Think of it as an assembly line. Your spider scrapes raw data, and pipelines clean it up and save it somewhere.
class CleanDataPipeline:
def process_item(self, item, spider):
# Clean the price
item['price'] = item['price'].replace('$', '').strip()
item['price'] = float(item['price'])
return item
Do you need this? Not immediately.
Scrapy can save your data to JSON or CSV without any pipelines. Just run:
scrapy crawl books -o books.json
Use pipelines when you need to:
- Clean or transform data
- Save to a database
- Remove duplicates
- Validate data
My advice: Start without pipelines. Add them when your data needs processing.
middlewares.py (Advanced Stuff)
What it is: Code that intercepts requests before they're sent and responses before they reach your spider.
Middlewares let you:
- Rotate user agents
- Use proxies
- Add custom headers
- Handle retries differently
Do you need this? Probably not for your first 10 projects.
This is advanced territory. You'll know when you need it (usually when sites start blocking you and you need to get creative).
My advice: Ignore this file completely for now.
The Commands You'll Actually Use
Now that you understand the structure, here are the commands you'll run every day:
Create a New Project
scrapy startproject project_name
Example:
scrapy startproject amazon_scraper
Create a New Spider
cd project_name
scrapy genspider spider_name domain.com
Example:
cd amazon_scraper
scrapy genspider products amazon.com
List All Your Spiders
scrapy list
Shows all spiders in your project.
Run a Spider
scrapy crawl spider_name
Example:
scrapy crawl books
Save Output to a File
scrapy crawl spider_name -o filename.json
Examples:
scrapy crawl books -o books.json
scrapy crawl books -o books.csv
scrapy crawl books -o books.xml
The file appears in your project root directory.
Test Your Selectors (Super Useful!)
scrapy shell "https://example.com"
This opens an interactive shell where you can test CSS selectors:
>>> response.css('h1::text').get()
'Welcome to Example'
>>> response.css('.price::text').getall()
['$10.99', '$12.50', '$8.99']
This is incredibly useful. Test your selectors here before putting them in your spider. It'll save you hours of debugging.
View a Page (As Scrapy Sees It)
scrapy view "https://example.com"
Opens the page in your browser showing exactly what Scrapy downloaded. Great for debugging.
A Real Example: Building Your First Spider
Let me show you the complete workflow from start to finish.
Step 1: Create the Project
scrapy startproject quotes
cd quotes
Step 2: Update Settings
Open quotes/settings.py and change these lines:
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
DOWNLOAD_DELAY = 1
ROBOTSTXT_OBEY = True
Step 3: Create a Spider
scrapy genspider quote_spider quotes.toscrape.com
Step 4: Test Your Selectors First
scrapy shell "https://quotes.toscrape.com"
Try these:
>>> response.css('div.quote span.text::text').get()
>>> response.css('small.author::text').get()
>>> response.css('div.tags a.tag::text').getall()
Perfect! These selectors work.
Step 5: Write Your Spider
Open quotes/spiders/quote_spider.py and replace the code:
import scrapy
class QuoteSpiderSpider(scrapy.Spider):
name = "quote_spider"
start_urls = ["https://quotes.toscrape.com"]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
# Follow pagination
next_page = response.css('li.next a::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
Step 6: Run It
scrapy crawl quote_spider -o quotes.json
Step 7: Check Your Data
Open quotes.json and boom! You've got all the quotes.
Congratulations! You just built a complete web scraper.
What You Actually Need to Remember
Here's the reality: you don't need to understand every file right now.
Focus on these:
- spiders/ folder (where your scraping code lives)
- settings.py (set USER_AGENT and DOWNLOAD_DELAY)
- scrapy crawl command (how you run spiders)
- scrapy shell command (for testing selectors)
You can ignore:
- scrapy.cfg (deployment stuff)
- items.py (use dictionaries instead)
- pipelines.py (until you need data processing)
- middlewares.py (advanced features)
Start simple. Get a spider working. Add complexity later when you actually need it.
Common Questions Beginners Ask
Q: Do I need to use items.py?
Nope. Dictionaries work perfectly fine. Use Items when your project gets bigger or you want more structure.
Q: When do I need pipelines?
When you need to clean data, save to databases, or remove duplicates. Otherwise, -o output.json works great.
Q: Can I have multiple spiders in one project?
Yes! That's actually the point. Put all related spiders in one project. Each spider gets its own file in the spiders/ folder.
Q: How do I run a specific spider?
Use the name from your spider class: scrapy crawl spider_name
Q: My spider isn't finding anything. What's wrong?
99% of the time, your CSS selectors are wrong. Use scrapy shell to test them first.
Q: How do I know what CSS selector to use?
Open the page in your browser, right-click on what you want to scrape, and select "Inspect Element." Look at the HTML structure.
The Visual Overview
Here's what your project looks like with everything in place:
bookstore/ # Project container
│
├── scrapy.cfg # Deployment config (ignore)
│
└── bookstore/ # Your Python package
│
├── __init__.py # Package marker (ignore)
│
├── items.py # Data structures (optional)
│
├── middlewares.py # Advanced features (later)
│
├── pipelines.py # Data processing (when needed)
│
├── settings.py # CONFIGURE THIS
│ └── Set: USER_AGENT, DOWNLOAD_DELAY
│
└── spiders/ # YOUR CODE LIVES HERE
│
├── __init__.py # Package marker (ignore)
│
├── books.py # Your book spider
├── reviews.py # Maybe a reviews spider
└── authors.py # Maybe an authors spider
Your Workflow (Reality Check)
Here's what you'll actually do every time you start a new scraping project:
- Create project:
scrapy startproject myproject - Update settings.py (USER_AGENT, DOWNLOAD_DELAY)
- Create spider:
scrapy genspider myspider example.com - Test selectors:
scrapy shell "https://example.com" - Write your spider code
- Run it:
scrapy crawl myspider -o data.json - Check your data
That's it. That's the whole workflow.
You'll spend 90% of your time writing and debugging spider code. Everything else is just setup.
One More Thing
Don't try to learn everything at once. I made that mistake when I started.
I tried to understand Items, Pipelines, Middlewares, and all the settings before I'd even built a working spider. It was overwhelming and unnecessary.
Instead:
- Build a simple spider that works
- Get comfortable with CSS selectors
- Run a few successful scrapes
- Then (and only then) start adding complexity
Master the basics first. Everything else can wait.
Final Thoughts
Scrapy's project structure looks intimidating, but you're really only working with two things:
- Your spider code (in the spiders/ folder)
- Basic settings (in settings.py)
Everything else? You'll learn it when you need it.
Start with a simple spider. Get it working. Feel that success. Then build from there.
That's how everyone learns Scrapy. That's how I learned it. And that's how you'll learn it too.
Now go build something!
Got questions? Drop them in the comments. We're all learning together, and there's no such thing as a stupid question.
Happy scraping! 🕷️
Top comments (0)