I scraped 100,000 products and saved them to a JSON file. The file was 500MB. I wanted to find products under $50. I had to load the entire 500MB file into memory just to search it.
My computer crashed.
Then I learned to save data to a database. I could search instantly. Filter by price. Sort by date. Update products. All without loading everything into memory.
Let me show you databases in the simplest way possible.
What is a Database? (Super Simple Explanation)
A database is like a super organized filing cabinet for your data.
JSON file (what you know):
One big file
All data mixed together
Hard to search
Hard to update
Slow with lots of data
Database (what you need):
Organized into tables
Easy to search
Easy to update
Fast even with millions of records
Think of it like this:
JSON file = Throwing all papers in one big box
- Need to dump out entire box to find anything
- Messy
- Slow
Database = Organized filing cabinet with labels
- Open specific drawer
- Find exactly what you need
- Fast
- Organized
Two Types of Databases
Type 1: SQL Databases (Relational)
Examples: MySQL, PostgreSQL, SQLite
Think of it like: Excel spreadsheet with multiple sheets
Structure:
Products Table:
┌────┬──────────┬─────────┬───────┐
│ ID │ Name │ Price │ Stock │
├────┼──────────┼─────────┼───────┤
│ 1 │ Laptop │ 999.99 │ 5 │
│ 2 │ Mouse │ 29.99 │ 50 │
│ 3 │ Keyboard │ 79.99 │ 20 │
└────┴──────────┴─────────┴───────┘
Reviews Table:
┌────┬────────────┬──────────┬────────┐
│ ID │ Product_ID │ Rating │ Text │
├────┼────────────┼──────────┼────────┤
│ 1 │ 1 │ 5 │ Great! │
│ 2 │ 1 │ 4 │ Good │
│ 3 │ 2 │ 5 │ Love │
└────┴────────────┴──────────┴────────┘
Good for:
- Structured data
- Relationships between data
- Complex queries
- Data that fits in tables
Type 2: NoSQL Databases (Non-Relational)
Examples: MongoDB, Redis, Elasticsearch
Think of it like: Pile of documents, each can be different
Structure:
Document 1:
{
"name": "Laptop",
"price": 999.99,
"specs": {
"ram": "16GB",
"cpu": "i7"
}
}
Document 2:
{
"name": "Mouse",
"price": 29.99,
"color": "black",
"wireless": true
}
Good for:
- Flexible data
- Data that changes structure
- Very large datasets
- Fast writes
When to Use SQL vs NoSQL
Use SQL When:
1. Data has clear structure
Every product has:
- Name
- Price
- Stock
- Category
(Same fields every time)
2. Need relationships
Products → Reviews
Products → Categories
Orders → Products
(Data connects to other data)
3. Need complex queries
-- Find products under $50 in Electronics category
-- with average rating above 4 stars
SELECT products.name
FROM products
JOIN reviews ON products.id = reviews.product_id
WHERE products.price < 50
AND products.category = 'Electronics'
GROUP BY products.id
HAVING AVG(reviews.rating) > 4
4. Data integrity matters
Money transactions
Inventory counts
User accounts
(Must be accurate!)
Use NoSQL When:
1. Flexible data structure
Some products have 5 fields
Some have 50 fields
Each different
(Structure varies)
2. Very large scale
Billions of records
Terabytes of data
Need horizontal scaling
3. Fast writes needed
Logging
Real-time data
High-speed inserts
4. Simple queries
Get product by ID
List recent items
Basic filters
Database Options for Scrapy
Option 1: SQLite (Easiest!)
What it is: File-based SQL database
Pros:
- No installation needed
- Just one file
- Perfect for learning
- Built into Python
Cons:
- Not for multiple spiders
- Slower with huge data
- No network access
Cost: Free
When to use:
- Learning
- Small projects (< 100,000 records)
- Single spider
- Local development
Option 2: PostgreSQL (Most Popular)
What it is: Full-featured SQL database
Pros:
- Powerful
- Fast
- Handles large data
- Good documentation
- Free and open source
Cons:
- Requires installation
- Setup needed
- More complex
Cost: Free (self-hosted)
When to use:
- Serious projects
- Need reliability
- Structured data
- Multiple spiders
Option 3: MySQL
What it is: Popular SQL database
Pros:
- Very popular
- Lots of tutorials
- Good performance
- Easy to find hosting
Cons:
- Setup required
- License considerations
Cost: Free (community edition)
When to use:
- Similar to PostgreSQL
- Already know MySQL
- Existing MySQL infrastructure
Option 4: MongoDB (NoSQL)
What it is: Document database
Pros:
- Flexible schema
- Fast for simple queries
- Easy to scale
- Good for varied data
Cons:
- Not good for complex queries
- Requires learning new concepts
Cost: Free (self-hosted)
When to use:
- Data structure varies
- Very large scale
- Fast writes needed
- Simple queries only
Option 5: Elasticsearch (NoSQL)
What it is: Search and analytics database
Pros:
- Amazing search capabilities
- Full-text search
- Real-time analytics
- Great for logs
Cons:
- Complex setup
- Memory intensive
- Overkill for simple use
Cost: Free (basic)
When to use:
- Need search features
- Text-heavy data
- Analytics needed
Using SQLite with Scrapy (Easiest Start)
Let's start with the simplest option: SQLite.
Step 1: Install Library
# SQLite is built-in, but we need this helper
pip install dataset
Step 2: Create Pipeline
# pipelines.py
import dataset
class SQLitePipeline:
def __init__(self):
# Connect to database (creates file if doesn't exist)
self.db = dataset.connect('sqlite:///scraped_data.db')
def process_item(self, item, spider):
# Save to 'products' table
table = self.db['products']
table.insert(dict(item))
spider.logger.info(f"Saved: {item.get('name')}")
return item
Step 3: Enable Pipeline
# settings.py
ITEM_PIPELINES = {
'myproject.pipelines.SQLitePipeline': 300,
}
Step 4: Run Spider
scrapy crawl myspider
Data automatically saves to scraped_data.db file!
Step 5: Query Data
# query.py
import dataset
db = dataset.connect('sqlite:///scraped_data.db')
table = db['products']
# Get all products
for product in table.all():
print(product['name'], product['price'])
# Find products under $50
cheap_products = table.find(price={'<': 50})
for product in cheap_products:
print(product)
# Count total products
print(f"Total: {len(table)}")
That's it! Super simple.
Using PostgreSQL with Scrapy (Production)
For serious projects, use PostgreSQL.
Step 1: Install PostgreSQL
On Ubuntu:
sudo apt update
sudo apt install postgresql postgresql-contrib
sudo systemctl start postgresql
On Mac:
brew install postgresql
brew services start postgresql
On Windows:
Download from: https://www.postgresql.org/download/
Step 2: Create Database
# Access PostgreSQL
sudo -u postgres psql
# Create database
CREATE DATABASE scrapy_data;
# Create user
CREATE USER scrapy_user WITH PASSWORD 'your_password';
# Grant permissions
GRANT ALL PRIVILEGES ON DATABASE scrapy_data TO scrapy_user;
# Exit
\q
Step 3: Install Python Library
pip install psycopg2-binary
Step 4: Create Pipeline
# pipelines.py
import psycopg2
class PostgresPipeline:
def __init__(self):
# Database connection details
self.connection = psycopg2.connect(
host='localhost',
database='scrapy_data',
user='scrapy_user',
password='your_password'
)
self.cursor = self.connection.cursor()
# Create table if doesn't exist
self.cursor.execute("""
CREATE TABLE IF NOT EXISTS products (
id SERIAL PRIMARY KEY,
name TEXT,
price DECIMAL,
url TEXT UNIQUE,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
self.connection.commit()
def process_item(self, item, spider):
try:
# Insert data
self.cursor.execute("""
INSERT INTO products (name, price, url)
VALUES (%s, %s, %s)
ON CONFLICT (url) DO UPDATE
SET name = EXCLUDED.name,
price = EXCLUDED.price,
scraped_at = CURRENT_TIMESTAMP
""", (
item.get('name'),
item.get('price'),
item.get('url')
))
self.connection.commit()
spider.logger.info(f"Saved to PostgreSQL: {item.get('name')}")
except Exception as e:
spider.logger.error(f"Error saving to PostgreSQL: {e}")
self.connection.rollback()
return item
def close_spider(self, spider):
# Close connection when spider closes
self.cursor.close()
self.connection.close()
Step 5: Enable Pipeline
# settings.py
ITEM_PIPELINES = {
'myproject.pipelines.PostgresPipeline': 300,
}
Step 6: Query Data
# query.py
import psycopg2
conn = psycopg2.connect(
host='localhost',
database='scrapy_data',
user='scrapy_user',
password='your_password'
)
cursor = conn.cursor()
# Get products under $50
cursor.execute("SELECT name, price FROM products WHERE price < 50")
for row in cursor.fetchall():
print(f"{row[0]}: ${row[1]}")
# Count total
cursor.execute("SELECT COUNT(*) FROM products")
print(f"Total products: {cursor.fetchone()[0]}")
cursor.close()
conn.close()
Using MongoDB with Scrapy (NoSQL)
For flexible data, use MongoDB.
Step 1: Install MongoDB
On Ubuntu:
sudo apt install mongodb
sudo systemctl start mongodb
On Mac:
brew tap mongodb/brew
brew install mongodb-community
brew services start mongodb-community
Or use Docker (easiest):
docker run -d -p 27017:27017 mongo
Step 2: Install Python Library
pip install pymongo
Step 3: Create Pipeline
# pipelines.py
import pymongo
class MongoPipeline:
def __init__(self):
# Connect to MongoDB
self.client = pymongo.MongoClient('localhost', 27017)
# Use database
self.db = self.client['scrapy_data']
# Use collection (like a table)
self.collection = self.db['products']
def process_item(self, item, spider):
# Insert document
self.collection.insert_one(dict(item))
spider.logger.info(f"Saved to MongoDB: {item.get('name')}")
return item
def close_spider(self, spider):
self.client.close()
Step 4: Enable Pipeline
# settings.py
ITEM_PIPELINES = {
'myproject.pipelines.MongoPipeline': 300,
}
Step 5: Query Data
# query.py
import pymongo
client = pymongo.MongoClient('localhost', 27017)
db = client['scrapy_data']
collection = db['products']
# Get all products
for product in collection.find():
print(product['name'], product['price'])
# Find products under $50
cheap = collection.find({'price': {'$lt': 50}})
for product in cheap:
print(product)
# Count
print(f"Total: {collection.count_documents({})}")
Handling Updates (Important!)
What if you scrape the same product twice?
Option 1: Skip Duplicates
def process_item(self, item, spider):
# Check if exists
existing = self.collection.find_one({'url': item['url']})
if existing:
spider.logger.info(f"Skipping duplicate: {item['url']}")
raise DropItem("Duplicate")
# Save if new
self.collection.insert_one(dict(item))
return item
Option 2: Update if Exists
def process_item(self, item, spider):
# Update or insert (upsert)
self.collection.update_one(
{'url': item['url']}, # Find by URL
{'$set': dict(item)}, # Update with new data
upsert=True # Insert if doesn't exist
)
return item
PostgreSQL version:
self.cursor.execute("""
INSERT INTO products (name, price, url)
VALUES (%s, %s, %s)
ON CONFLICT (url) DO UPDATE
SET name = EXCLUDED.name,
price = EXCLUDED.price
""", (item['name'], item['price'], item['url']))
Complete Real Example
Let's build a complete product scraper with PostgreSQL.
Spider
# spiders/products.py
import scrapy
class ProductSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://example.com/products']
def parse(self, response):
for product in response.css('.product'):
yield {
'name': product.css('.name::text').get(),
'price': float(product.css('.price::text').get().replace('$', '')),
'url': response.urljoin(product.css('a::attr(href)').get()),
'category': product.css('.category::text').get(),
'in_stock': product.css('.stock::text').get() == 'In Stock'
}
# Pagination
next_page = response.css('.next::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
Pipeline
# pipelines.py
import psycopg2
from datetime import datetime
class ProductPipeline:
def __init__(self):
self.connection = psycopg2.connect(
host='localhost',
database='scrapy_data',
user='scrapy_user',
password='your_password'
)
self.cursor = self.connection.cursor()
self.create_table()
def create_table(self):
self.cursor.execute("""
CREATE TABLE IF NOT EXISTS products (
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
price DECIMAL(10, 2),
url TEXT UNIQUE NOT NULL,
category TEXT,
in_stock BOOLEAN,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
self.connection.commit()
def process_item(self, item, spider):
try:
self.cursor.execute("""
INSERT INTO products (name, price, url, category, in_stock)
VALUES (%s, %s, %s, %s, %s)
ON CONFLICT (url) DO UPDATE
SET name = EXCLUDED.name,
price = EXCLUDED.price,
category = EXCLUDED.category,
in_stock = EXCLUDED.in_stock,
updated_at = CURRENT_TIMESTAMP
""", (
item['name'],
item['price'],
item['url'],
item['category'],
item['in_stock']
))
self.connection.commit()
spider.logger.info(f"✓ Saved: {item['name']}")
except Exception as e:
spider.logger.error(f"✗ Error: {e}")
self.connection.rollback()
return item
def close_spider(self, spider):
# Get stats
self.cursor.execute("SELECT COUNT(*) FROM products")
total = self.cursor.fetchone()[0]
spider.logger.info(f"Total products in database: {total}")
self.cursor.close()
self.connection.close()
Settings
# settings.py
ITEM_PIPELINES = {
'myproject.pipelines.ProductPipeline': 300,
}
# Other settings
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 1
Query Script
# analyze.py
import psycopg2
conn = psycopg2.connect(
host='localhost',
database='scrapy_data',
user='scrapy_user',
password='your_password'
)
cursor = conn.cursor()
# Statistics
print("Product Statistics")
print("="*50)
# Total products
cursor.execute("SELECT COUNT(*) FROM products")
print(f"Total products: {cursor.fetchone()[0]}")
# In stock
cursor.execute("SELECT COUNT(*) FROM products WHERE in_stock = true")
print(f"In stock: {cursor.fetchone()[0]}")
# Average price
cursor.execute("SELECT AVG(price) FROM products")
print(f"Average price: ${cursor.fetchone()[0]:.2f}")
# Products by category
cursor.execute("""
SELECT category, COUNT(*), AVG(price)
FROM products
GROUP BY category
ORDER BY COUNT(*) DESC
""")
print("\nBy Category:")
for row in cursor.fetchall():
print(f" {row[0]}: {row[1]} products, avg ${row[2]:.2f}")
# Cheapest products
print("\nCheapest 5 Products:")
cursor.execute("SELECT name, price FROM products ORDER BY price LIMIT 5")
for row in cursor.fetchall():
print(f" {row[0]}: ${row[1]}")
cursor.close()
conn.close()
Connection Pooling (Important!)
For multiple spiders, use connection pooling:
# pipelines.py
from psycopg2 import pool
class PooledPostgresPipeline:
connection_pool = None
@classmethod
def from_crawler(cls, crawler):
# Create connection pool once for all spiders
if cls.connection_pool is None:
cls.connection_pool = pool.SimpleConnectionPool(
1, # Min connections
10, # Max connections
host='localhost',
database='scrapy_data',
user='scrapy_user',
password='your_password'
)
return cls()
def process_item(self, item, spider):
# Get connection from pool
connection = self.connection_pool.getconn()
cursor = connection.cursor()
try:
cursor.execute("""
INSERT INTO products (name, price, url)
VALUES (%s, %s, %s)
""", (item['name'], item['price'], item['url']))
connection.commit()
except Exception as e:
connection.rollback()
spider.logger.error(f"Error: {e}")
finally:
cursor.close()
# Return connection to pool
self.connection_pool.putconn(connection)
return item
Common Mistakes
Mistake 1: Not Closing Connections
# BAD: Connection never closes
def __init__(self):
self.connection = psycopg2.connect(...)
# No close_spider method!
# GOOD: Always close
def close_spider(self, spider):
self.connection.close()
Mistake 2: Not Handling Duplicates
# BAD: Crashes on duplicate URL
INSERT INTO products (url) VALUES (...)
# GOOD: Handle duplicates
INSERT INTO products (url) VALUES (...)
ON CONFLICT (url) DO UPDATE ...
Mistake 3: Not Using Transactions
# BAD: No commit, data lost!
cursor.execute("INSERT ...")
# Forgot self.connection.commit()!
# GOOD: Always commit
cursor.execute("INSERT ...")
self.connection.commit()
Mistake 4: Storing Everything as Text
# BAD: Wrong data types
price TEXT
in_stock TEXT
date TEXT
# GOOD: Correct types
price DECIMAL(10, 2)
in_stock BOOLEAN
date TIMESTAMP
When to Use What?
Use SQLite When:
- Learning databases
- Small projects (< 100,000 records)
- Single spider
- Local development
- Quick prototypes
Use PostgreSQL When:
- Production projects
- Need reliability
- Multiple spiders
- Complex queries
- Data integrity important
Use MySQL When:
- Already know MySQL
- Existing MySQL servers
- Similar to PostgreSQL use cases
Use MongoDB When:
- Data structure varies widely
- Very flexible schema needed
- Simple queries only
- Large scale (millions+ records)
Use Elasticsearch When:
- Need full-text search
- Analytics required
- Log storage
- Real-time search
Quick Decision Guide
How many records?
< 10,000 → SQLite (simplest)
< 1,000,000 → PostgreSQL (best)
> 1,000,000 → PostgreSQL or MongoDB
Data structure?
Always same fields → SQL (PostgreSQL)
Different fields → NoSQL (MongoDB)
Need relationships?
Products → Reviews → Users → SQL
Just documents → NoSQL
Query complexity?
Complex (JOINs, GROUP BY) → SQL
Simple (find by ID) → NoSQL
Summary
What is a database?
Organized storage for your scraped data.
Two types:
- SQL (tables, structured, relationships)
- NoSQL (documents, flexible, simple)
Best options:
- Learning: SQLite
- Production: PostgreSQL
- Flexible data: MongoDB
Basic pipeline:
class DatabasePipeline:
def __init__(self):
# Connect to database
def process_item(self, item, spider):
# Save item
return item
def close_spider(self, spider):
# Close connection
Remember:
- SQLite for learning
- PostgreSQL for production
- Handle duplicates (upsert)
- Always close connections
- Use correct data types
- Commit transactions
When to use databases:
- More than 1,000 records
- Need to search/filter data
- Updating existing records
- Data has relationships
- Long-term storage
When NOT to use:
- Less than 100 records (JSON is fine)
- One-time scrape
- Just learning Scrapy basics
- Data goes elsewhere immediately
Start simple with SQLite. Upgrade to PostgreSQL when you need it!
Happy scraping! 🕷️
Top comments (0)