DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Databases with Scrapy: The Beginner's Guide

I scraped 100,000 products and saved them to a JSON file. The file was 500MB. I wanted to find products under $50. I had to load the entire 500MB file into memory just to search it.

My computer crashed.

Then I learned to save data to a database. I could search instantly. Filter by price. Sort by date. Update products. All without loading everything into memory.

Let me show you databases in the simplest way possible.


What is a Database? (Super Simple Explanation)

A database is like a super organized filing cabinet for your data.

JSON file (what you know):

One big file
All data mixed together
Hard to search
Hard to update
Slow with lots of data
Enter fullscreen mode Exit fullscreen mode

Database (what you need):

Organized into tables
Easy to search
Easy to update
Fast even with millions of records
Enter fullscreen mode Exit fullscreen mode

Think of it like this:

JSON file = Throwing all papers in one big box

  • Need to dump out entire box to find anything
  • Messy
  • Slow

Database = Organized filing cabinet with labels

  • Open specific drawer
  • Find exactly what you need
  • Fast
  • Organized

Two Types of Databases

Type 1: SQL Databases (Relational)

Examples: MySQL, PostgreSQL, SQLite

Think of it like: Excel spreadsheet with multiple sheets

Structure:

Products Table:
┌────┬──────────┬─────────┬───────┐
│ ID │ Name     │ Price   │ Stock │
├────┼──────────┼─────────┼───────┤
│ 1  │ Laptop   │ 999.99  │ 5     │
│ 2  │ Mouse    │ 29.99   │ 50    │
│ 3  │ Keyboard │ 79.99   │ 20    │
└────┴──────────┴─────────┴───────┘

Reviews Table:
┌────┬────────────┬──────────┬────────┐
│ ID │ Product_ID │ Rating   │ Text   │
├────┼────────────┼──────────┼────────┤
│ 1  │ 1          │ 5        │ Great! │
│ 2  │ 1          │ 4        │ Good   │
│ 3  │ 2          │ 5        │ Love   │
└────┴────────────┴──────────┴────────┘
Enter fullscreen mode Exit fullscreen mode

Good for:

  • Structured data
  • Relationships between data
  • Complex queries
  • Data that fits in tables

Type 2: NoSQL Databases (Non-Relational)

Examples: MongoDB, Redis, Elasticsearch

Think of it like: Pile of documents, each can be different

Structure:

Document 1:
{
  "name": "Laptop",
  "price": 999.99,
  "specs": {
    "ram": "16GB",
    "cpu": "i7"
  }
}

Document 2:
{
  "name": "Mouse",
  "price": 29.99,
  "color": "black",
  "wireless": true
}
Enter fullscreen mode Exit fullscreen mode

Good for:

  • Flexible data
  • Data that changes structure
  • Very large datasets
  • Fast writes

When to Use SQL vs NoSQL

Use SQL When:

1. Data has clear structure

Every product has:
- Name
- Price
- Stock
- Category
(Same fields every time)
Enter fullscreen mode Exit fullscreen mode

2. Need relationships

Products → Reviews
Products → Categories
Orders → Products
(Data connects to other data)
Enter fullscreen mode Exit fullscreen mode

3. Need complex queries

-- Find products under $50 in Electronics category
-- with average rating above 4 stars
SELECT products.name 
FROM products 
JOIN reviews ON products.id = reviews.product_id
WHERE products.price < 50 
AND products.category = 'Electronics'
GROUP BY products.id
HAVING AVG(reviews.rating) > 4
Enter fullscreen mode Exit fullscreen mode

4. Data integrity matters

Money transactions
Inventory counts
User accounts
(Must be accurate!)
Enter fullscreen mode Exit fullscreen mode

Use NoSQL When:

1. Flexible data structure

Some products have 5 fields
Some have 50 fields
Each different
(Structure varies)
Enter fullscreen mode Exit fullscreen mode

2. Very large scale

Billions of records
Terabytes of data
Need horizontal scaling
Enter fullscreen mode Exit fullscreen mode

3. Fast writes needed

Logging
Real-time data
High-speed inserts
Enter fullscreen mode Exit fullscreen mode

4. Simple queries

Get product by ID
List recent items
Basic filters
Enter fullscreen mode Exit fullscreen mode

Database Options for Scrapy

Option 1: SQLite (Easiest!)

What it is: File-based SQL database

Pros:

  • No installation needed
  • Just one file
  • Perfect for learning
  • Built into Python

Cons:

  • Not for multiple spiders
  • Slower with huge data
  • No network access

Cost: Free

When to use:

  • Learning
  • Small projects (< 100,000 records)
  • Single spider
  • Local development

Option 2: PostgreSQL (Most Popular)

What it is: Full-featured SQL database

Pros:

  • Powerful
  • Fast
  • Handles large data
  • Good documentation
  • Free and open source

Cons:

  • Requires installation
  • Setup needed
  • More complex

Cost: Free (self-hosted)

When to use:

  • Serious projects
  • Need reliability
  • Structured data
  • Multiple spiders

Option 3: MySQL

What it is: Popular SQL database

Pros:

  • Very popular
  • Lots of tutorials
  • Good performance
  • Easy to find hosting

Cons:

  • Setup required
  • License considerations

Cost: Free (community edition)

When to use:

  • Similar to PostgreSQL
  • Already know MySQL
  • Existing MySQL infrastructure

Option 4: MongoDB (NoSQL)

What it is: Document database

Pros:

  • Flexible schema
  • Fast for simple queries
  • Easy to scale
  • Good for varied data

Cons:

  • Not good for complex queries
  • Requires learning new concepts

Cost: Free (self-hosted)

When to use:

  • Data structure varies
  • Very large scale
  • Fast writes needed
  • Simple queries only

Option 5: Elasticsearch (NoSQL)

What it is: Search and analytics database

Pros:

  • Amazing search capabilities
  • Full-text search
  • Real-time analytics
  • Great for logs

Cons:

  • Complex setup
  • Memory intensive
  • Overkill for simple use

Cost: Free (basic)

When to use:

  • Need search features
  • Text-heavy data
  • Analytics needed

Using SQLite with Scrapy (Easiest Start)

Let's start with the simplest option: SQLite.

Step 1: Install Library

# SQLite is built-in, but we need this helper
pip install dataset
Enter fullscreen mode Exit fullscreen mode

Step 2: Create Pipeline

# pipelines.py
import dataset

class SQLitePipeline:
    def __init__(self):
        # Connect to database (creates file if doesn't exist)
        self.db = dataset.connect('sqlite:///scraped_data.db')

    def process_item(self, item, spider):
        # Save to 'products' table
        table = self.db['products']
        table.insert(dict(item))

        spider.logger.info(f"Saved: {item.get('name')}")
        return item
Enter fullscreen mode Exit fullscreen mode

Step 3: Enable Pipeline

# settings.py
ITEM_PIPELINES = {
    'myproject.pipelines.SQLitePipeline': 300,
}
Enter fullscreen mode Exit fullscreen mode

Step 4: Run Spider

scrapy crawl myspider
Enter fullscreen mode Exit fullscreen mode

Data automatically saves to scraped_data.db file!

Step 5: Query Data

# query.py
import dataset

db = dataset.connect('sqlite:///scraped_data.db')
table = db['products']

# Get all products
for product in table.all():
    print(product['name'], product['price'])

# Find products under $50
cheap_products = table.find(price={'<': 50})
for product in cheap_products:
    print(product)

# Count total products
print(f"Total: {len(table)}")
Enter fullscreen mode Exit fullscreen mode

That's it! Super simple.


Using PostgreSQL with Scrapy (Production)

For serious projects, use PostgreSQL.

Step 1: Install PostgreSQL

On Ubuntu:

sudo apt update
sudo apt install postgresql postgresql-contrib
sudo systemctl start postgresql
Enter fullscreen mode Exit fullscreen mode

On Mac:

brew install postgresql
brew services start postgresql
Enter fullscreen mode Exit fullscreen mode

On Windows:
Download from: https://www.postgresql.org/download/

Step 2: Create Database

# Access PostgreSQL
sudo -u postgres psql

# Create database
CREATE DATABASE scrapy_data;

# Create user
CREATE USER scrapy_user WITH PASSWORD 'your_password';

# Grant permissions
GRANT ALL PRIVILEGES ON DATABASE scrapy_data TO scrapy_user;

# Exit
\q
Enter fullscreen mode Exit fullscreen mode

Step 3: Install Python Library

pip install psycopg2-binary
Enter fullscreen mode Exit fullscreen mode

Step 4: Create Pipeline

# pipelines.py
import psycopg2

class PostgresPipeline:
    def __init__(self):
        # Database connection details
        self.connection = psycopg2.connect(
            host='localhost',
            database='scrapy_data',
            user='scrapy_user',
            password='your_password'
        )
        self.cursor = self.connection.cursor()

        # Create table if doesn't exist
        self.cursor.execute("""
            CREATE TABLE IF NOT EXISTS products (
                id SERIAL PRIMARY KEY,
                name TEXT,
                price DECIMAL,
                url TEXT UNIQUE,
                scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
            )
        """)
        self.connection.commit()

    def process_item(self, item, spider):
        try:
            # Insert data
            self.cursor.execute("""
                INSERT INTO products (name, price, url)
                VALUES (%s, %s, %s)
                ON CONFLICT (url) DO UPDATE
                SET name = EXCLUDED.name,
                    price = EXCLUDED.price,
                    scraped_at = CURRENT_TIMESTAMP
            """, (
                item.get('name'),
                item.get('price'),
                item.get('url')
            ))

            self.connection.commit()
            spider.logger.info(f"Saved to PostgreSQL: {item.get('name')}")

        except Exception as e:
            spider.logger.error(f"Error saving to PostgreSQL: {e}")
            self.connection.rollback()

        return item

    def close_spider(self, spider):
        # Close connection when spider closes
        self.cursor.close()
        self.connection.close()
Enter fullscreen mode Exit fullscreen mode

Step 5: Enable Pipeline

# settings.py
ITEM_PIPELINES = {
    'myproject.pipelines.PostgresPipeline': 300,
}
Enter fullscreen mode Exit fullscreen mode

Step 6: Query Data

# query.py
import psycopg2

conn = psycopg2.connect(
    host='localhost',
    database='scrapy_data',
    user='scrapy_user',
    password='your_password'
)

cursor = conn.cursor()

# Get products under $50
cursor.execute("SELECT name, price FROM products WHERE price < 50")
for row in cursor.fetchall():
    print(f"{row[0]}: ${row[1]}")

# Count total
cursor.execute("SELECT COUNT(*) FROM products")
print(f"Total products: {cursor.fetchone()[0]}")

cursor.close()
conn.close()
Enter fullscreen mode Exit fullscreen mode

Using MongoDB with Scrapy (NoSQL)

For flexible data, use MongoDB.

Step 1: Install MongoDB

On Ubuntu:

sudo apt install mongodb
sudo systemctl start mongodb
Enter fullscreen mode Exit fullscreen mode

On Mac:

brew tap mongodb/brew
brew install mongodb-community
brew services start mongodb-community
Enter fullscreen mode Exit fullscreen mode

Or use Docker (easiest):

docker run -d -p 27017:27017 mongo
Enter fullscreen mode Exit fullscreen mode

Step 2: Install Python Library

pip install pymongo
Enter fullscreen mode Exit fullscreen mode

Step 3: Create Pipeline

# pipelines.py
import pymongo

class MongoPipeline:
    def __init__(self):
        # Connect to MongoDB
        self.client = pymongo.MongoClient('localhost', 27017)

        # Use database
        self.db = self.client['scrapy_data']

        # Use collection (like a table)
        self.collection = self.db['products']

    def process_item(self, item, spider):
        # Insert document
        self.collection.insert_one(dict(item))
        spider.logger.info(f"Saved to MongoDB: {item.get('name')}")
        return item

    def close_spider(self, spider):
        self.client.close()
Enter fullscreen mode Exit fullscreen mode

Step 4: Enable Pipeline

# settings.py
ITEM_PIPELINES = {
    'myproject.pipelines.MongoPipeline': 300,
}
Enter fullscreen mode Exit fullscreen mode

Step 5: Query Data

# query.py
import pymongo

client = pymongo.MongoClient('localhost', 27017)
db = client['scrapy_data']
collection = db['products']

# Get all products
for product in collection.find():
    print(product['name'], product['price'])

# Find products under $50
cheap = collection.find({'price': {'$lt': 50}})
for product in cheap:
    print(product)

# Count
print(f"Total: {collection.count_documents({})}")
Enter fullscreen mode Exit fullscreen mode

Handling Updates (Important!)

What if you scrape the same product twice?

Option 1: Skip Duplicates

def process_item(self, item, spider):
    # Check if exists
    existing = self.collection.find_one({'url': item['url']})

    if existing:
        spider.logger.info(f"Skipping duplicate: {item['url']}")
        raise DropItem("Duplicate")

    # Save if new
    self.collection.insert_one(dict(item))
    return item
Enter fullscreen mode Exit fullscreen mode

Option 2: Update if Exists

def process_item(self, item, spider):
    # Update or insert (upsert)
    self.collection.update_one(
        {'url': item['url']},  # Find by URL
        {'$set': dict(item)},  # Update with new data
        upsert=True  # Insert if doesn't exist
    )
    return item
Enter fullscreen mode Exit fullscreen mode

PostgreSQL version:

self.cursor.execute("""
    INSERT INTO products (name, price, url)
    VALUES (%s, %s, %s)
    ON CONFLICT (url) DO UPDATE
    SET name = EXCLUDED.name,
        price = EXCLUDED.price
""", (item['name'], item['price'], item['url']))
Enter fullscreen mode Exit fullscreen mode

Complete Real Example

Let's build a complete product scraper with PostgreSQL.

Spider

# spiders/products.py
import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        for product in response.css('.product'):
            yield {
                'name': product.css('.name::text').get(),
                'price': float(product.css('.price::text').get().replace('$', '')),
                'url': response.urljoin(product.css('a::attr(href)').get()),
                'category': product.css('.category::text').get(),
                'in_stock': product.css('.stock::text').get() == 'In Stock'
            }

        # Pagination
        next_page = response.css('.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)
Enter fullscreen mode Exit fullscreen mode

Pipeline

# pipelines.py
import psycopg2
from datetime import datetime

class ProductPipeline:
    def __init__(self):
        self.connection = psycopg2.connect(
            host='localhost',
            database='scrapy_data',
            user='scrapy_user',
            password='your_password'
        )
        self.cursor = self.connection.cursor()
        self.create_table()

    def create_table(self):
        self.cursor.execute("""
            CREATE TABLE IF NOT EXISTS products (
                id SERIAL PRIMARY KEY,
                name TEXT NOT NULL,
                price DECIMAL(10, 2),
                url TEXT UNIQUE NOT NULL,
                category TEXT,
                in_stock BOOLEAN,
                created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
            )
        """)
        self.connection.commit()

    def process_item(self, item, spider):
        try:
            self.cursor.execute("""
                INSERT INTO products (name, price, url, category, in_stock)
                VALUES (%s, %s, %s, %s, %s)
                ON CONFLICT (url) DO UPDATE
                SET name = EXCLUDED.name,
                    price = EXCLUDED.price,
                    category = EXCLUDED.category,
                    in_stock = EXCLUDED.in_stock,
                    updated_at = CURRENT_TIMESTAMP
            """, (
                item['name'],
                item['price'],
                item['url'],
                item['category'],
                item['in_stock']
            ))

            self.connection.commit()
            spider.logger.info(f"✓ Saved: {item['name']}")

        except Exception as e:
            spider.logger.error(f"✗ Error: {e}")
            self.connection.rollback()

        return item

    def close_spider(self, spider):
        # Get stats
        self.cursor.execute("SELECT COUNT(*) FROM products")
        total = self.cursor.fetchone()[0]
        spider.logger.info(f"Total products in database: {total}")

        self.cursor.close()
        self.connection.close()
Enter fullscreen mode Exit fullscreen mode

Settings

# settings.py
ITEM_PIPELINES = {
    'myproject.pipelines.ProductPipeline': 300,
}

# Other settings
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 1
Enter fullscreen mode Exit fullscreen mode

Query Script

# analyze.py
import psycopg2

conn = psycopg2.connect(
    host='localhost',
    database='scrapy_data',
    user='scrapy_user',
    password='your_password'
)

cursor = conn.cursor()

# Statistics
print("Product Statistics")
print("="*50)

# Total products
cursor.execute("SELECT COUNT(*) FROM products")
print(f"Total products: {cursor.fetchone()[0]}")

# In stock
cursor.execute("SELECT COUNT(*) FROM products WHERE in_stock = true")
print(f"In stock: {cursor.fetchone()[0]}")

# Average price
cursor.execute("SELECT AVG(price) FROM products")
print(f"Average price: ${cursor.fetchone()[0]:.2f}")

# Products by category
cursor.execute("""
    SELECT category, COUNT(*), AVG(price)
    FROM products
    GROUP BY category
    ORDER BY COUNT(*) DESC
""")

print("\nBy Category:")
for row in cursor.fetchall():
    print(f"  {row[0]}: {row[1]} products, avg ${row[2]:.2f}")

# Cheapest products
print("\nCheapest 5 Products:")
cursor.execute("SELECT name, price FROM products ORDER BY price LIMIT 5")
for row in cursor.fetchall():
    print(f"  {row[0]}: ${row[1]}")

cursor.close()
conn.close()
Enter fullscreen mode Exit fullscreen mode

Connection Pooling (Important!)

For multiple spiders, use connection pooling:

# pipelines.py
from psycopg2 import pool

class PooledPostgresPipeline:
    connection_pool = None

    @classmethod
    def from_crawler(cls, crawler):
        # Create connection pool once for all spiders
        if cls.connection_pool is None:
            cls.connection_pool = pool.SimpleConnectionPool(
                1,  # Min connections
                10,  # Max connections
                host='localhost',
                database='scrapy_data',
                user='scrapy_user',
                password='your_password'
            )
        return cls()

    def process_item(self, item, spider):
        # Get connection from pool
        connection = self.connection_pool.getconn()
        cursor = connection.cursor()

        try:
            cursor.execute("""
                INSERT INTO products (name, price, url)
                VALUES (%s, %s, %s)
            """, (item['name'], item['price'], item['url']))

            connection.commit()

        except Exception as e:
            connection.rollback()
            spider.logger.error(f"Error: {e}")

        finally:
            cursor.close()
            # Return connection to pool
            self.connection_pool.putconn(connection)

        return item
Enter fullscreen mode Exit fullscreen mode

Common Mistakes

Mistake 1: Not Closing Connections

# BAD: Connection never closes
def __init__(self):
    self.connection = psycopg2.connect(...)
# No close_spider method!

# GOOD: Always close
def close_spider(self, spider):
    self.connection.close()
Enter fullscreen mode Exit fullscreen mode

Mistake 2: Not Handling Duplicates

# BAD: Crashes on duplicate URL
INSERT INTO products (url) VALUES (...)

# GOOD: Handle duplicates
INSERT INTO products (url) VALUES (...)
ON CONFLICT (url) DO UPDATE ...
Enter fullscreen mode Exit fullscreen mode

Mistake 3: Not Using Transactions

# BAD: No commit, data lost!
cursor.execute("INSERT ...")
# Forgot self.connection.commit()!

# GOOD: Always commit
cursor.execute("INSERT ...")
self.connection.commit()
Enter fullscreen mode Exit fullscreen mode

Mistake 4: Storing Everything as Text

# BAD: Wrong data types
price TEXT
in_stock TEXT
date TEXT

# GOOD: Correct types
price DECIMAL(10, 2)
in_stock BOOLEAN
date TIMESTAMP
Enter fullscreen mode Exit fullscreen mode

When to Use What?

Use SQLite When:

  • Learning databases
  • Small projects (< 100,000 records)
  • Single spider
  • Local development
  • Quick prototypes

Use PostgreSQL When:

  • Production projects
  • Need reliability
  • Multiple spiders
  • Complex queries
  • Data integrity important

Use MySQL When:

  • Already know MySQL
  • Existing MySQL servers
  • Similar to PostgreSQL use cases

Use MongoDB When:

  • Data structure varies widely
  • Very flexible schema needed
  • Simple queries only
  • Large scale (millions+ records)

Use Elasticsearch When:

  • Need full-text search
  • Analytics required
  • Log storage
  • Real-time search

Quick Decision Guide

How many records?

< 10,000    → SQLite (simplest)
< 1,000,000 → PostgreSQL (best)
> 1,000,000 → PostgreSQL or MongoDB
Enter fullscreen mode Exit fullscreen mode

Data structure?

Always same fields  → SQL (PostgreSQL)
Different fields    → NoSQL (MongoDB)
Enter fullscreen mode Exit fullscreen mode

Need relationships?

Products → Reviews → Users → SQL
Just documents             → NoSQL
Enter fullscreen mode Exit fullscreen mode

Query complexity?

Complex (JOINs, GROUP BY) → SQL
Simple (find by ID)       → NoSQL
Enter fullscreen mode Exit fullscreen mode

Summary

What is a database?
Organized storage for your scraped data.

Two types:

  • SQL (tables, structured, relationships)
  • NoSQL (documents, flexible, simple)

Best options:

  • Learning: SQLite
  • Production: PostgreSQL
  • Flexible data: MongoDB

Basic pipeline:

class DatabasePipeline:
    def __init__(self):
        # Connect to database

    def process_item(self, item, spider):
        # Save item
        return item

    def close_spider(self, spider):
        # Close connection
Enter fullscreen mode Exit fullscreen mode

Remember:

  • SQLite for learning
  • PostgreSQL for production
  • Handle duplicates (upsert)
  • Always close connections
  • Use correct data types
  • Commit transactions

When to use databases:

  • More than 1,000 records
  • Need to search/filter data
  • Updating existing records
  • Data has relationships
  • Long-term storage

When NOT to use:

  • Less than 100 records (JSON is fine)
  • One-time scrape
  • Just learning Scrapy basics
  • Data goes elsewhere immediately

Start simple with SQLite. Upgrade to PostgreSQL when you need it!

Happy scraping! 🕷️

Top comments (0)