Muhammad Ikramullah Khan

Posted on Jan 24

Databases with Scrapy: The Beginner's Guide

#webdev #programming #python #database

I scraped 100,000 products and saved them to a JSON file. The file was 500MB. I wanted to find products under $50. I had to load the entire 500MB file into memory just to search it.

My computer crashed.

Then I learned to save data to a database. I could search instantly. Filter by price. Sort by date. Update products. All without loading everything into memory.

Let me show you databases in the simplest way possible.

What is a Database? (Super Simple Explanation)

A database is like a super organized filing cabinet for your data.

JSON file (what you know):

One big file
All data mixed together
Hard to search
Hard to update
Slow with lots of data

Database (what you need):

Organized into tables
Easy to search
Easy to update
Fast even with millions of records

Think of it like this:

JSON file = Throwing all papers in one big box

Need to dump out entire box to find anything
Messy
Slow

Database = Organized filing cabinet with labels

Open specific drawer
Find exactly what you need
Fast
Organized

Two Types of Databases

Type 1: SQL Databases (Relational)

Examples: MySQL, PostgreSQL, SQLite

Think of it like: Excel spreadsheet with multiple sheets

Structure:

Products Table:
┌────┬──────────┬─────────┬───────┐
│ ID │ Name     │ Price   │ Stock │
├────┼──────────┼─────────┼───────┤
│ 1  │ Laptop   │ 999.99  │ 5     │
│ 2  │ Mouse    │ 29.99   │ 50    │
│ 3  │ Keyboard │ 79.99   │ 20    │
└────┴──────────┴─────────┴───────┘

Reviews Table:
┌────┬────────────┬──────────┬────────┐
│ ID │ Product_ID │ Rating   │ Text   │
├────┼────────────┼──────────┼────────┤
│ 1  │ 1          │ 5        │ Great! │
│ 2  │ 1          │ 4        │ Good   │
│ 3  │ 2          │ 5        │ Love   │
└────┴────────────┴──────────┴────────┘

Good for:

Structured data
Relationships between data
Complex queries
Data that fits in tables

Type 2: NoSQL Databases (Non-Relational)

Examples: MongoDB, Redis, Elasticsearch

Think of it like: Pile of documents, each can be different

Structure:

Document 1:
{
  "name": "Laptop",
  "price": 999.99,
  "specs": {
    "ram": "16GB",
    "cpu": "i7"
  }
}

Document 2:
{
  "name": "Mouse",
  "price": 29.99,
  "color": "black",
  "wireless": true
}

Good for:

Flexible data
Data that changes structure
Very large datasets
Fast writes

When to Use SQL vs NoSQL

Use SQL When:

1. Data has clear structure

Every product has:
- Name
- Price
- Stock
- Category
(Same fields every time)

2. Need relationships

Products → Reviews
Products → Categories
Orders → Products
(Data connects to other data)

3. Need complex queries

-- Find products under $50 in Electronics category
-- with average rating above 4 stars
SELECT products.name 
FROM products 
JOIN reviews ON products.id = reviews.product_id
WHERE products.price < 50 
AND products.category = 'Electronics'
GROUP BY products.id
HAVING AVG(reviews.rating) > 4

4. Data integrity matters

Money transactions
Inventory counts
User accounts
(Must be accurate!)

Use NoSQL When:

1. Flexible data structure

Some products have 5 fields
Some have 50 fields
Each different
(Structure varies)

2. Very large scale

Billions of records
Terabytes of data
Need horizontal scaling

3. Fast writes needed

Logging
Real-time data
High-speed inserts

4. Simple queries

Get product by ID
List recent items
Basic filters

Database Options for Scrapy

Option 1: SQLite (Easiest!)

What it is: File-based SQL database

Pros:

No installation needed
Just one file
Perfect for learning
Built into Python

Cons:

Not for multiple spiders
Slower with huge data
No network access

Cost: Free

When to use:

Learning
Small projects (< 100,000 records)
Single spider
Local development

Option 2: PostgreSQL (Most Popular)

What it is: Full-featured SQL database

Pros:

Powerful
Fast
Handles large data
Good documentation
Free and open source

Cons:

Requires installation
Setup needed
More complex

Cost: Free (self-hosted)

When to use:

Serious projects
Need reliability
Structured data
Multiple spiders

Option 3: MySQL

What it is: Popular SQL database

Pros:

Very popular
Lots of tutorials
Good performance
Easy to find hosting

Cons:

Setup required
License considerations

Cost: Free (community edition)

When to use:

Similar to PostgreSQL
Already know MySQL
Existing MySQL infrastructure

Option 4: MongoDB (NoSQL)

What it is: Document database

Pros:

Flexible schema
Fast for simple queries
Easy to scale
Good for varied data

Cons:

Not good for complex queries
Requires learning new concepts

Cost: Free (self-hosted)

When to use:

Data structure varies
Very large scale
Fast writes needed
Simple queries only

Option 5: Elasticsearch (NoSQL)

What it is: Search and analytics database

Pros:

Amazing search capabilities
Full-text search
Real-time analytics
Great for logs

Cons:

Complex setup
Memory intensive
Overkill for simple use

Cost: Free (basic)

When to use:

Need search features
Text-heavy data
Analytics needed

Using SQLite with Scrapy (Easiest Start)

Let's start with the simplest option: SQLite.

Step 1: Install Library

# SQLite is built-in, but we need this helper
pip install dataset

Step 2: Create Pipeline

# pipelines.py
import dataset

class SQLitePipeline:
    def __init__(self):
        # Connect to database (creates file if doesn't exist)
        self.db = dataset.connect('sqlite:///scraped_data.db')

    def process_item(self, item, spider):
        # Save to 'products' table
        table = self.db['products']
        table.insert(dict(item))

        spider.logger.info(f"Saved: {item.get('name')}")
        return item

Step 3: Enable Pipeline

# settings.py
ITEM_PIPELINES = {
    'myproject.pipelines.SQLitePipeline': 300,
}

Step 4: Run Spider

scrapy crawl myspider

Data automatically saves to scraped_data.db file!

Step 5: Query Data

# query.py
import dataset

db = dataset.connect('sqlite:///scraped_data.db')
table = db['products']

# Get all products
for product in table.all():
    print(product['name'], product['price'])

# Find products under $50
cheap_products = table.find(price={'<': 50})
for product in cheap_products:
    print(product)

# Count total products
print(f"Total: {len(table)}")

That's it! Super simple.

Using PostgreSQL with Scrapy (Production)

For serious projects, use PostgreSQL.

Step 1: Install PostgreSQL

On Ubuntu:

sudo apt update
sudo apt install postgresql postgresql-contrib
sudo systemctl start postgresql

On Mac:

brew install postgresql
brew services start postgresql

On Windows:
Download from: https://www.postgresql.org/download/

Step 2: Create Database

# Access PostgreSQL
sudo -u postgres psql

# Create database
CREATE DATABASE scrapy_data;

# Create user
CREATE USER scrapy_user WITH PASSWORD 'your_password';

# Grant permissions
GRANT ALL PRIVILEGES ON DATABASE scrapy_data TO scrapy_user;

# Exit
\q

Step 3: Install Python Library

pip install psycopg2-binary

Step 4: Create Pipeline

# pipelines.py
import psycopg2

class PostgresPipeline:
    def __init__(self):
        # Database connection details
        self.connection = psycopg2.connect(
            host='localhost',
            database='scrapy_data',
            user='scrapy_user',
            password='your_password'
        )
        self.cursor = self.connection.cursor()

        # Create table if doesn't exist
        self.cursor.execute("""
            CREATE TABLE IF NOT EXISTS products (
                id SERIAL PRIMARY KEY,
                name TEXT,
                price DECIMAL,
                url TEXT UNIQUE,
                scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
            )
        """)
        self.connection.commit()

    def process_item(self, item, spider):
        try:
            # Insert data
            self.cursor.execute("""
                INSERT INTO products (name, price, url)
                VALUES (%s, %s, %s)
                ON CONFLICT (url) DO UPDATE
                SET name = EXCLUDED.name,
                    price = EXCLUDED.price,
                    scraped_at = CURRENT_TIMESTAMP
            """, (
                item.get('name'),
                item.get('price'),
                item.get('url')
            ))

            self.connection.commit()
            spider.logger.info(f"Saved to PostgreSQL: {item.get('name')}")

        except Exception as e:
            spider.logger.error(f"Error saving to PostgreSQL: {e}")
            self.connection.rollback()

        return item

    def close_spider(self, spider):
        # Close connection when spider closes
        self.cursor.close()
        self.connection.close()

Step 5: Enable Pipeline

# settings.py
ITEM_PIPELINES = {
    'myproject.pipelines.PostgresPipeline': 300,
}

Step 6: Query Data

# query.py
import psycopg2

conn = psycopg2.connect(
    host='localhost',
    database='scrapy_data',
    user='scrapy_user',
    password='your_password'
)

cursor = conn.cursor()

# Get products under $50
cursor.execute("SELECT name, price FROM products WHERE price < 50")
for row in cursor.fetchall():
    print(f"{row[0]}: ${row[1]}")

# Count total
cursor.execute("SELECT COUNT(*) FROM products")
print(f"Total products: {cursor.fetchone()[0]}")

cursor.close()
conn.close()

Using MongoDB with Scrapy (NoSQL)

For flexible data, use MongoDB.

Step 1: Install MongoDB

On Ubuntu:

sudo apt install mongodb
sudo systemctl start mongodb

On Mac:

brew tap mongodb/brew
brew install mongodb-community
brew services start mongodb-community

Or use Docker (easiest):

docker run -d -p 27017:27017 mongo

Step 2: Install Python Library

pip install pymongo

Step 3: Create Pipeline

# pipelines.py
import pymongo

class MongoPipeline:
    def __init__(self):
        # Connect to MongoDB
        self.client = pymongo.MongoClient('localhost', 27017)

        # Use database
        self.db = self.client['scrapy_data']

        # Use collection (like a table)
        self.collection = self.db['products']

    def process_item(self, item, spider):
        # Insert document
        self.collection.insert_one(dict(item))
        spider.logger.info(f"Saved to MongoDB: {item.get('name')}")
        return item

    def close_spider(self, spider):
        self.client.close()

Step 4: Enable Pipeline

# settings.py
ITEM_PIPELINES = {
    'myproject.pipelines.MongoPipeline': 300,
}

Step 5: Query Data

# query.py
import pymongo

client = pymongo.MongoClient('localhost', 27017)
db = client['scrapy_data']
collection = db['products']

# Get all products
for product in collection.find():
    print(product['name'], product['price'])

# Find products under $50
cheap = collection.find({'price': {'$lt': 50}})
for product in cheap:
    print(product)

# Count
print(f"Total: {collection.count_documents({})}")

Handling Updates (Important!)

What if you scrape the same product twice?

Option 1: Skip Duplicates

def process_item(self, item, spider):
    # Check if exists
    existing = self.collection.find_one({'url': item['url']})

    if existing:
        spider.logger.info(f"Skipping duplicate: {item['url']}")
        raise DropItem("Duplicate")

    # Save if new
    self.collection.insert_one(dict(item))
    return item

Option 2: Update if Exists

def process_item(self, item, spider):
    # Update or insert (upsert)
    self.collection.update_one(
        {'url': item['url']},  # Find by URL
        {'$set': dict(item)},  # Update with new data
        upsert=True  # Insert if doesn't exist
    )
    return item

PostgreSQL version:

self.cursor.execute("""
    INSERT INTO products (name, price, url)
    VALUES (%s, %s, %s)
    ON CONFLICT (url) DO UPDATE
    SET name = EXCLUDED.name,
        price = EXCLUDED.price
""", (item['name'], item['price'], item['url']))

Complete Real Example

Let's build a complete product scraper with PostgreSQL.

Spider

# spiders/products.py
import scrapy

class ProductSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        for product in response.css('.product'):
            yield {
                'name': product.css('.name::text').get(),
                'price': float(product.css('.price::text').get().replace('$', '')),
                'url': response.urljoin(product.css('a::attr(href)').get()),
                'category': product.css('.category::text').get(),
                'in_stock': product.css('.stock::text').get() == 'In Stock'
            }

        # Pagination
        next_page = response.css('.next::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

Pipeline

# pipelines.py
import psycopg2
from datetime import datetime

class ProductPipeline:
    def __init__(self):
        self.connection = psycopg2.connect(
            host='localhost',
            database='scrapy_data',
            user='scrapy_user',
            password='your_password'
        )
        self.cursor = self.connection.cursor()
        self.create_table()

    def create_table(self):
        self.cursor.execute("""
            CREATE TABLE IF NOT EXISTS products (
                id SERIAL PRIMARY KEY,
                name TEXT NOT NULL,
                price DECIMAL(10, 2),
                url TEXT UNIQUE NOT NULL,
                category TEXT,
                in_stock BOOLEAN,
                created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
            )
        """)
        self.connection.commit()

    def process_item(self, item, spider):
        try:
            self.cursor.execute("""
                INSERT INTO products (name, price, url, category, in_stock)
                VALUES (%s, %s, %s, %s, %s)
                ON CONFLICT (url) DO UPDATE
                SET name = EXCLUDED.name,
                    price = EXCLUDED.price,
                    category = EXCLUDED.category,
                    in_stock = EXCLUDED.in_stock,
                    updated_at = CURRENT_TIMESTAMP
            """, (
                item['name'],
                item['price'],
                item['url'],
                item['category'],
                item['in_stock']
            ))

            self.connection.commit()
            spider.logger.info(f"✓ Saved: {item['name']}")

        except Exception as e:
            spider.logger.error(f"✗ Error: {e}")
            self.connection.rollback()

        return item

    def close_spider(self, spider):
        # Get stats
        self.cursor.execute("SELECT COUNT(*) FROM products")
        total = self.cursor.fetchone()[0]
        spider.logger.info(f"Total products in database: {total}")

        self.cursor.close()
        self.connection.close()

Settings

# settings.py
ITEM_PIPELINES = {
    'myproject.pipelines.ProductPipeline': 300,
}

# Other settings
ROBOTSTXT_OBEY = True
DOWNLOAD_DELAY = 1

Query Script

# analyze.py
import psycopg2

conn = psycopg2.connect(
    host='localhost',
    database='scrapy_data',
    user='scrapy_user',
    password='your_password'
)

cursor = conn.cursor()

# Statistics
print("Product Statistics")
print("="*50)

# Total products
cursor.execute("SELECT COUNT(*) FROM products")
print(f"Total products: {cursor.fetchone()[0]}")

# In stock
cursor.execute("SELECT COUNT(*) FROM products WHERE in_stock = true")
print(f"In stock: {cursor.fetchone()[0]}")

# Average price
cursor.execute("SELECT AVG(price) FROM products")
print(f"Average price: ${cursor.fetchone()[0]:.2f}")

# Products by category
cursor.execute("""
    SELECT category, COUNT(*), AVG(price)
    FROM products
    GROUP BY category
    ORDER BY COUNT(*) DESC
""")

print("\nBy Category:")
for row in cursor.fetchall():
    print(f"  {row[0]}: {row[1]} products, avg ${row[2]:.2f}")

# Cheapest products
print("\nCheapest 5 Products:")
cursor.execute("SELECT name, price FROM products ORDER BY price LIMIT 5")
for row in cursor.fetchall():
    print(f"  {row[0]}: ${row[1]}")

cursor.close()
conn.close()

Connection Pooling (Important!)

For multiple spiders, use connection pooling:

# pipelines.py
from psycopg2 import pool

class PooledPostgresPipeline:
    connection_pool = None

    @classmethod
    def from_crawler(cls, crawler):
        # Create connection pool once for all spiders
        if cls.connection_pool is None:
            cls.connection_pool = pool.SimpleConnectionPool(
                1,  # Min connections
                10,  # Max connections
                host='localhost',
                database='scrapy_data',
                user='scrapy_user',
                password='your_password'
            )
        return cls()

    def process_item(self, item, spider):
        # Get connection from pool
        connection = self.connection_pool.getconn()
        cursor = connection.cursor()

        try:
            cursor.execute("""
                INSERT INTO products (name, price, url)
                VALUES (%s, %s, %s)
            """, (item['name'], item['price'], item['url']))

            connection.commit()

        except Exception as e:
            connection.rollback()
            spider.logger.error(f"Error: {e}")

        finally:
            cursor.close()
            # Return connection to pool
            self.connection_pool.putconn(connection)

        return item

Common Mistakes

Mistake 1: Not Closing Connections

# BAD: Connection never closes
def __init__(self):
    self.connection = psycopg2.connect(...)
# No close_spider method!

# GOOD: Always close
def close_spider(self, spider):
    self.connection.close()

Mistake 2: Not Handling Duplicates

# BAD: Crashes on duplicate URL
INSERT INTO products (url) VALUES (...)

# GOOD: Handle duplicates
INSERT INTO products (url) VALUES (...)
ON CONFLICT (url) DO UPDATE ...

Mistake 3: Not Using Transactions

# BAD: No commit, data lost!
cursor.execute("INSERT ...")
# Forgot self.connection.commit()!

# GOOD: Always commit
cursor.execute("INSERT ...")
self.connection.commit()

Mistake 4: Storing Everything as Text

# BAD: Wrong data types
price TEXT
in_stock TEXT
date TEXT

# GOOD: Correct types
price DECIMAL(10, 2)
in_stock BOOLEAN
date TIMESTAMP

When to Use What?

Use SQLite When:

Learning databases
Small projects (< 100,000 records)
Single spider
Local development
Quick prototypes

Use PostgreSQL When:

Production projects
Need reliability
Multiple spiders
Complex queries
Data integrity important

Use MySQL When:

Already know MySQL
Existing MySQL servers
Similar to PostgreSQL use cases

Use MongoDB When:

Data structure varies widely
Very flexible schema needed
Simple queries only
Large scale (millions+ records)

Use Elasticsearch When:

Need full-text search
Analytics required
Log storage
Real-time search

Quick Decision Guide

How many records?

< 10,000    → SQLite (simplest)
< 1,000,000 → PostgreSQL (best)
> 1,000,000 → PostgreSQL or MongoDB

Data structure?

Always same fields  → SQL (PostgreSQL)
Different fields    → NoSQL (MongoDB)

Need relationships?

Products → Reviews → Users → SQL
Just documents             → NoSQL

Query complexity?

Complex (JOINs, GROUP BY) → SQL
Simple (find by ID)       → NoSQL

Summary

What is a database?
Organized storage for your scraped data.

Two types:

SQL (tables, structured, relationships)
NoSQL (documents, flexible, simple)

Best options:

Learning: SQLite
Production: PostgreSQL
Flexible data: MongoDB

Basic pipeline:

class DatabasePipeline:
    def __init__(self):
        # Connect to database

    def process_item(self, item, spider):
        # Save item
        return item

    def close_spider(self, spider):
        # Close connection