DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

PostgreSQL for Scrapers: From CSV Hell to Proper Data Storage

You've been scraping for 3 months. You have 127 CSV files in a folder named "scraped_data".

products_march.csv, products_march_v2.csv, products_march_final.csv, products_march_final_REAL.csv, products_april_01.csv, products_april_02.csv...

You need to find when a specific product's price changed. You open 50 files manually. Use grep to search. Write Python scripts to loop through CSVs. It takes 2 hours to answer one question.

Then your boss asks: "Which products had the biggest price drops this month?" You spend an entire afternoon writing more scripts. Pandas DataFrames everywhere. Memory errors on large files. Duplicate products across files. Inconsistent column names.

You realize something. CSV files are not a database. They're just text files with commas. The moment you need to query data, compare records, or track changes over time, CSV breaks down completely.

Time for a real database.

PostgreSQL solves this. One database. All your scraped data in tables. Query anything in seconds. Track price changes automatically. No more file sprawl. No more custom scripts for every question.

Let me show you how.


When CSV Files Stop Working

Let's be honest about when you need a database.

Scenario 1: Finding Historical Data

CSV approach:

# You have 90 CSV files, one per day
# Question: What was the price of product X on March 15?

import pandas as pd
import glob

files = glob.glob('products_*.csv')
for file in files:
    if 'march_15' in file:
        df = pd.read_csv(file)
        product = df[df['sku'] == 'ABC123']
        if not product.empty:
            print(product['price'].values[0])
            break
Enter fullscreen mode Exit fullscreen mode

Takes 5 minutes to write. Fails if filename doesn't match pattern.

PostgreSQL approach:

SELECT price 
FROM products 
WHERE sku = 'ABC123' 
  AND scraped_date = '2024-03-15';
Enter fullscreen mode Exit fullscreen mode

Takes 5 seconds to write. Always works.

Scenario 2: Tracking Price Changes

CSV approach:

# Compare today's prices with last week
import pandas as pd

today = pd.read_csv('products_2024_03_22.csv')
last_week = pd.read_csv('products_2024_03_15.csv')

# Merge on SKU
merged = today.merge(
    last_week, 
    on='sku', 
    suffixes=('_today', '_last_week')
)

# Find changes
changes = merged[merged['price_today'] != merged['price_last_week']]
Enter fullscreen mode Exit fullscreen mode

Works for two files. Breaks when you need to check 90 days of data.

PostgreSQL approach:

SELECT 
    sku,
    price,
    LAG(price) OVER (PARTITION BY sku ORDER BY scraped_date) as previous_price,
    price - LAG(price) OVER (PARTITION BY sku ORDER BY scraped_date) as price_change
FROM products
WHERE scraped_date >= CURRENT_DATE - INTERVAL '90 days';
Enter fullscreen mode Exit fullscreen mode

Handles any date range. One query.

Scenario 3: Deduplication

CSV approach:

# Combine 30 CSV files and remove duplicates
import pandas as pd
import glob

dfs = []
for file in glob.glob('products_*.csv'):
    df = pd.read_csv(file)
    dfs.append(df)

combined = pd.concat(dfs, ignore_index=True)
unique = combined.drop_duplicates(subset=['sku'], keep='last')
Enter fullscreen mode Exit fullscreen mode

Uses gigabytes of memory. Slow. Need to rerun every time.

PostgreSQL approach:

-- Duplicates automatically handled by primary key
-- Or query unique products anytime:
SELECT DISTINCT ON (sku) *
FROM products
ORDER BY sku, scraped_date DESC;
Enter fullscreen mode Exit fullscreen mode

Instant. No memory issues.

Signs You Need a Database

You know you need PostgreSQL when:

  • You have more than 10 CSV files of the same data
  • You need to query data by date, product, or any field
  • You're tracking changes over time
  • You're comparing data across multiple scrapes
  • You're running out of memory loading CSVs
  • You spend more time managing files than analyzing data
  • You need multiple people to access the data
  • You're building reports or dashboards

If you checked 3 or more, stop reading and install PostgreSQL now.


Installing PostgreSQL (The Easy Way)

Skip the complicated installation guides. Use Docker.

Why Docker?

  • Works on Windows, Mac, Linux identically
  • No system-wide installation
  • Easy to delete and start fresh
  • Isolated from your main system
  • Includes all tools you need

Install Docker

Mac:

# Download Docker Desktop from docker.com
# Or use homebrew:
brew install --cask docker
Enter fullscreen mode Exit fullscreen mode

Windows:

# Download Docker Desktop from docker.com
# Make sure WSL2 is enabled
Enter fullscreen mode Exit fullscreen mode

Linux:

# Ubuntu/Debian
sudo apt-get update
sudo apt-get install docker.io docker-compose

# Start Docker
sudo systemctl start docker
sudo systemctl enable docker
Enter fullscreen mode Exit fullscreen mode

Start PostgreSQL Container

# Create a directory for data persistence
mkdir -p ~/postgres_data

# Run PostgreSQL container
docker run -d \
  --name postgres_scraping \
  -e POSTGRES_USER=scraper \
  -e POSTGRES_PASSWORD=scraping123 \
  -e POSTGRES_DB=scraped_data \
  -p 5432:5432 \
  -v ~/postgres_data:/var/lib/postgresql/data \
  postgres:15

# Check it's running
docker ps
Enter fullscreen mode Exit fullscreen mode

PostgreSQL is now running on localhost:5432.

Connection details:

  • Host: localhost
  • Port: 5432
  • Database: scraped_data
  • User: scraper
  • Password: scraping123

Install psql (PostgreSQL Client)

# Mac
brew install postgresql

# Ubuntu/Debian
sudo apt-get install postgresql-client

# Windows
# Download from postgresql.org or use Docker
Enter fullscreen mode Exit fullscreen mode

Connect to PostgreSQL

# Using psql
psql -h localhost -U scraper -d scraped_data

# You'll see:
# Password for user scraper: (enter: scraping123)
# scraped_data=#
Enter fullscreen mode Exit fullscreen mode

You're in! PostgreSQL is ready.


Your First Table: Products

Let's create a table for scraped products.

Understanding Table Structure

A table has:

  • Columns: Fields (name, price, rating)
  • Rows: Individual records (one product)
  • Data types: What kind of data (text, numbers, dates)
  • Constraints: Rules (not null, unique, primary key)

Creating the Products Table

-- Connect to database
psql -h localhost -U scraper -d scraped_data

-- Create products table
CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    sku VARCHAR(100) UNIQUE NOT NULL,
    name TEXT NOT NULL,
    brand VARCHAR(200),
    category VARCHAR(200),
    price DECIMAL(10, 2),
    original_price DECIMAL(10, 2),
    rating DECIMAL(3, 2),
    review_count INTEGER,
    in_stock BOOLEAN,
    url TEXT,
    scraped_date DATE NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Add indexes for faster queries
CREATE INDEX idx_products_sku ON products(sku);
CREATE INDEX idx_products_scraped_date ON products(scraped_date);
CREATE INDEX idx_products_category ON products(category);
Enter fullscreen mode Exit fullscreen mode

What each column means:

  • id: Auto-incrementing unique identifier
  • sku: Product SKU (stock keeping unit), unique
  • name: Product name (can be long, use TEXT)
  • brand: Brand name (limited length, VARCHAR)
  • price: Current price (DECIMAL for exact money)
  • scraped_date: When you scraped this (for tracking changes)
  • created_at: When row was added to database

Column types:

  • SERIAL: Auto-incrementing integer (1, 2, 3...)
  • VARCHAR(n): Text with max length
  • TEXT: Text with no length limit
  • DECIMAL(10,2): Numbers with decimals (10 total digits, 2 after decimal)
  • INTEGER: Whole numbers
  • BOOLEAN: True/false
  • DATE: Date without time (2024-03-15)
  • TIMESTAMP: Date with time (2024-03-15 14:30:00)

Inserting Your First Product

INSERT INTO products (
    sku, name, brand, category, price, original_price, 
    rating, review_count, in_stock, url, scraped_date
) VALUES (
    'ABC123',
    'Laptop Pro 15',
    'TechBrand',
    'Electronics',
    1299.99,
    1499.99,
    4.5,
    235,
    true,
    'https://example.com/products/laptop-pro-15',
    '2024-03-15'
);
Enter fullscreen mode Exit fullscreen mode

Querying Your Data

-- Get all products
SELECT * FROM products;

-- Get specific columns
SELECT name, price, rating FROM products;

-- Filter by category
SELECT * FROM products WHERE category = 'Electronics';

-- Find expensive products
SELECT * FROM products WHERE price > 1000 ORDER BY price DESC;

-- Count products by category
SELECT category, COUNT(*) as product_count 
FROM products 
GROUP BY category;
Enter fullscreen mode Exit fullscreen mode

You just stored and queried data. No CSV files. No Pandas scripts.


Loading CSV Data Into PostgreSQL

You have existing CSV files. Let's migrate them.

Method 1: Using COPY Command (Fastest)

-- Assuming CSV has matching columns
COPY products (sku, name, brand, category, price, rating, scraped_date)
FROM '/path/to/products.csv'
WITH (FORMAT csv, HEADER true);
Enter fullscreen mode Exit fullscreen mode

Advantages:

  • Extremely fast (millions of rows per second)
  • Built-in to PostgreSQL

Disadvantages:

  • File must be on the PostgreSQL server
  • Column names must match exactly
  • Less flexible error handling

Method 2: Using Python (Most Flexible)

import pandas as pd
from sqlalchemy import create_engine

# Read CSV
df = pd.read_csv('products.csv')

# Clean data (using our Pandas cleaning from previous blog!)
df['price'] = df['price'].apply(clean_price)
df['scraped_date'] = pd.to_datetime(df['scraped_date'])

# Connect to PostgreSQL
engine = create_engine('postgresql://scraper:scraping123@localhost:5432/scraped_data')

# Load data
df.to_sql(
    'products',
    engine,
    if_exists='append',  # 'replace' to overwrite, 'append' to add
    index=False
)

print(f'Loaded {len(df)} products into PostgreSQL')
Enter fullscreen mode Exit fullscreen mode

This approach:

  • Lets you clean data first
  • Handles type conversions
  • Better error messages
  • Can transform data before loading

Method 3: Batch Loading Multiple CSVs

import pandas as pd
from sqlalchemy import create_engine
import glob

# Connect to PostgreSQL
engine = create_engine('postgresql://scraper:scraping123@localhost:5432/scraped_data')

# Find all CSV files
csv_files = glob.glob('scraped_data/*.csv')

total_loaded = 0

for csv_file in csv_files:
    print(f'Loading {csv_file}...')

    # Read and clean
    df = pd.read_csv(csv_file)
    df['price'] = pd.to_numeric(df['price'], errors='coerce')
    df['scraped_date'] = pd.to_datetime(df['scraped_date'])

    # Remove nulls in critical fields
    df = df.dropna(subset=['sku', 'name', 'price'])

    # Load to database
    df.to_sql('products', engine, if_exists='append', index=False)

    total_loaded += len(df)
    print(f'Loaded {len(df)} products')

print(f'\nTotal: {total_loaded} products loaded into PostgreSQL')
Enter fullscreen mode Exit fullscreen mode

All your CSVs are now in one database. Query anything instantly.


Scrapy Pipeline to PostgreSQL

Skip the CSV step entirely. Save directly from Scrapy to PostgreSQL.

Installing Required Libraries

pip install psycopg2-binary sqlalchemy
Enter fullscreen mode Exit fullscreen mode

Creating the Pipeline

# pipelines.py

import psycopg2
from psycopg2.extras import execute_values
from datetime import datetime

class PostgresPipeline:
    def __init__(self):
        self.connection = None
        self.cursor = None
        self.items_buffer = []
        self.buffer_size = 100  # Insert every 100 items

    def open_spider(self, spider):
        """Connect to PostgreSQL when spider starts"""
        self.connection = psycopg2.connect(
            host='localhost',
            database='scraped_data',
            user='scraper',
            password='scraping123',
            port=5432
        )
        self.cursor = self.connection.cursor()
        spider.logger.info('Connected to PostgreSQL')

    def close_spider(self, spider):
        """Insert remaining items and close connection"""
        if self.items_buffer:
            self._insert_items()

        self.cursor.close()
        self.connection.close()
        spider.logger.info('Disconnected from PostgreSQL')

    def process_item(self, item, spider):
        """Buffer items and insert in batches"""
        self.items_buffer.append(item)

        if len(self.items_buffer) >= self.buffer_size:
            self._insert_items()

        return item

    def _insert_items(self):
        """Insert buffered items into database"""
        if not self.items_buffer:
            return

        # Prepare data for insertion
        values = [
            (
                item.get('sku'),
                item.get('name'),
                item.get('brand'),
                item.get('category'),
                item.get('price'),
                item.get('original_price'),
                item.get('rating'),
                item.get('review_count'),
                item.get('in_stock', False),
                item.get('url'),
                datetime.now().date()
            )
            for item in self.items_buffer
        ]

        # Insert with ON CONFLICT (upsert)
        query = """
            INSERT INTO products (
                sku, name, brand, category, price, original_price,
                rating, review_count, in_stock, url, scraped_date
            ) VALUES %s
            ON CONFLICT (sku) DO UPDATE SET
                name = EXCLUDED.name,
                brand = EXCLUDED.brand,
                category = EXCLUDED.category,
                price = EXCLUDED.price,
                original_price = EXCLUDED.original_price,
                rating = EXCLUDED.rating,
                review_count = EXCLUDED.review_count,
                in_stock = EXCLUDED.in_stock,
                url = EXCLUDED.url,
                scraped_date = EXCLUDED.scraped_date
        """

        execute_values(self.cursor, query, values)
        self.connection.commit()

        print(f'Inserted {len(self.items_buffer)} products')

        # Clear buffer
        self.items_buffer = []
Enter fullscreen mode Exit fullscreen mode

Enable in settings.py

# settings.py

ITEM_PIPELINES = {
    'myproject.pipelines.PostgresPipeline': 300,
}
Enter fullscreen mode Exit fullscreen mode

How It Works

  1. Buffer items: Collect 100 items before inserting (faster than one-by-one)
  2. Batch insert: Insert all 100 at once
  3. Upsert logic: If SKU exists, update it; if not, insert new row
  4. Automatic dates: scraped_date set to today automatically

Benefits:

  • No CSV files created
  • Data available immediately in database
  • Handles duplicates automatically
  • Much faster than individual inserts

Tracking Price Changes Over Time

The real power of databases: historical tracking.

Creating a Price History Table

CREATE TABLE price_history (
    id SERIAL PRIMARY KEY,
    sku VARCHAR(100) NOT NULL,
    price DECIMAL(10, 2) NOT NULL,
    scraped_date DATE NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(sku, scraped_date)  -- One price per product per day
);

CREATE INDEX idx_price_history_sku ON price_history(sku);
CREATE INDEX idx_price_history_date ON price_history(scraped_date);
Enter fullscreen mode Exit fullscreen mode

Modified Pipeline for Price Tracking

# pipelines.py

class PriceTrackingPipeline:
    def __init__(self):
        self.connection = None
        self.cursor = None

    def open_spider(self, spider):
        self.connection = psycopg2.connect(
            host='localhost',
            database='scraped_data',
            user='scraper',
            password='scraping123'
        )
        self.cursor = self.connection.cursor()

    def close_spider(self, spider):
        self.cursor.close()
        self.connection.close()

    def process_item(self, item, spider):
        # Insert into products table (current state)
        self.cursor.execute("""
            INSERT INTO products (sku, name, price, scraped_date)
            VALUES (%s, %s, %s, %s)
            ON CONFLICT (sku) DO UPDATE SET
                price = EXCLUDED.price,
                scraped_date = EXCLUDED.scraped_date
        """, (
            item['sku'],
            item['name'],
            item['price'],
            datetime.now().date()
        ))

        # Insert into price_history table (historical tracking)
        self.cursor.execute("""
            INSERT INTO price_history (sku, price, scraped_date)
            VALUES (%s, %s, %s)
            ON CONFLICT (sku, scraped_date) DO NOTHING
        """, (
            item['sku'],
            item['price'],
            datetime.now().date()
        ))

        self.connection.commit()

        return item
Enter fullscreen mode Exit fullscreen mode

Querying Price Changes

-- Products with price changes in last 7 days
SELECT 
    p.sku,
    p.name,
    p.price as current_price,
    ph.price as previous_price,
    p.price - ph.price as price_change,
    ((p.price - ph.price) / ph.price * 100) as percent_change
FROM products p
JOIN price_history ph ON p.sku = ph.sku
WHERE ph.scraped_date = CURRENT_DATE - INTERVAL '7 days'
  AND p.price != ph.price
ORDER BY percent_change DESC;
Enter fullscreen mode Exit fullscreen mode
-- Price trend for a specific product
SELECT 
    scraped_date,
    price
FROM price_history
WHERE sku = 'ABC123'
ORDER BY scraped_date DESC
LIMIT 30;
Enter fullscreen mode Exit fullscreen mode
-- Products with biggest price drops this month
SELECT 
    sku,
    MIN(price) as lowest_price,
    MAX(price) as highest_price,
    MAX(price) - MIN(price) as price_drop
FROM price_history
WHERE scraped_date >= DATE_TRUNC('month', CURRENT_DATE)
GROUP BY sku
HAVING MAX(price) - MIN(price) > 0
ORDER BY price_drop DESC
LIMIT 20;
Enter fullscreen mode Exit fullscreen mode

This is impossible with CSV files. With PostgreSQL, it's one query.


Useful Queries for Scraped Data

Here are queries you'll use constantly.

Finding Duplicates

-- Products scraped multiple times on same day
SELECT sku, scraped_date, COUNT(*) as count
FROM products
GROUP BY sku, scraped_date
HAVING COUNT(*) > 1;
Enter fullscreen mode Exit fullscreen mode

Latest Products

-- Most recently scraped products
SELECT DISTINCT ON (sku) *
FROM products
ORDER BY sku, scraped_date DESC;
Enter fullscreen mode Exit fullscreen mode

Missing Data

-- Products with missing prices
SELECT sku, name, scraped_date
FROM products
WHERE price IS NULL;
Enter fullscreen mode Exit fullscreen mode

Category Statistics

-- Average price by category
SELECT 
    category,
    COUNT(*) as product_count,
    ROUND(AVG(price), 2) as avg_price,
    ROUND(MIN(price), 2) as min_price,
    ROUND(MAX(price), 2) as max_price
FROM products
WHERE price IS NOT NULL
GROUP BY category
ORDER BY avg_price DESC;
Enter fullscreen mode Exit fullscreen mode

Date Range Queries

-- Products scraped in last 7 days
SELECT * FROM products
WHERE scraped_date >= CURRENT_DATE - INTERVAL '7 days';

-- Products scraped in March 2024
SELECT * FROM products
WHERE scraped_date >= '2024-03-01'
  AND scraped_date < '2024-04-01';
Enter fullscreen mode Exit fullscreen mode

Competitor Comparison

-- Compare prices across brands for same category
SELECT 
    brand,
    COUNT(*) as products,
    ROUND(AVG(price), 2) as avg_price
FROM products
WHERE category = 'Laptops'
GROUP BY brand
ORDER BY avg_price ASC;
Enter fullscreen mode Exit fullscreen mode

Exporting Data from PostgreSQL

Sometimes you still need CSVs for sharing or analysis.

Export to CSV

# Using psql
psql -h localhost -U scraper -d scraped_data -c "COPY (SELECT * FROM products WHERE scraped_date = CURRENT_DATE) TO STDOUT WITH CSV HEADER" > products_today.csv
Enter fullscreen mode Exit fullscreen mode

Export with Python

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('postgresql://scraper:scraping123@localhost:5432/scraped_data')

# Query and export
query = "SELECT * FROM products WHERE scraped_date = CURRENT_DATE"
df = pd.read_sql(query, engine)
df.to_csv('products_today.csv', index=False)

print(f'Exported {len(df)} products')
Enter fullscreen mode Exit fullscreen mode

Export Price History Report

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('postgresql://scraper:scraping123@localhost:5432/scraped_data')

# Get price changes
query = """
    SELECT 
        p.sku,
        p.name,
        p.category,
        p.price as current_price,
        ph.price as previous_price,
        p.price - ph.price as price_change
    FROM products p
    JOIN price_history ph ON p.sku = ph.sku
    WHERE ph.scraped_date = CURRENT_DATE - INTERVAL '7 days'
      AND p.price != ph.price
"""

df = pd.read_sql(query, engine)
df.to_excel('price_changes_report.xlsx', index=False)

print(f'Price changes report saved: {len(df)} products')
Enter fullscreen mode Exit fullscreen mode

Backup and Maintenance

Databases need care.

Automated Backups

#!/bin/bash
# backup.sh

# Create backup directory
mkdir -p ~/postgres_backups

# Backup database
docker exec postgres_scraping pg_dump -U scraper scraped_data > ~/postgres_backups/backup_$(date +%Y%m%d).sql

# Keep only last 30 days of backups
find ~/postgres_backups -name "backup_*.sql" -mtime +30 -delete

echo "Backup complete: backup_$(date +%Y%m%d).sql"
Enter fullscreen mode Exit fullscreen mode
# Make executable
chmod +x backup.sh

# Add to crontab (run daily at 2 AM)
crontab -e
# Add line:
0 2 * * * /path/to/backup.sh
Enter fullscreen mode Exit fullscreen mode

Restoring from Backup

# Stop Scrapy spiders first!

# Restore database
docker exec -i postgres_scraping psql -U scraper scraped_data < ~/postgres_backups/backup_20240315.sql
Enter fullscreen mode Exit fullscreen mode

Cleaning Old Data

-- Delete products older than 90 days
DELETE FROM products
WHERE scraped_date < CURRENT_DATE - INTERVAL '90 days';

-- Delete old price history (keep 1 year)
DELETE FROM price_history
WHERE scraped_date < CURRENT_DATE - INTERVAL '1 year';

-- Vacuum to reclaim space
VACUUM ANALYZE;
Enter fullscreen mode Exit fullscreen mode

Monitoring Database Size

-- Database size
SELECT pg_size_pretty(pg_database_size('scraped_data'));

-- Table sizes
SELECT 
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
Enter fullscreen mode Exit fullscreen mode

Real Example: E-commerce Price Monitoring

Complete working example from scraping to querying.

1. Database Schema

-- Products table (current state)
CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    sku VARCHAR(100) UNIQUE NOT NULL,
    name TEXT NOT NULL,
    brand VARCHAR(200),
    category VARCHAR(200),
    price DECIMAL(10, 2),
    rating DECIMAL(3, 2),
    review_count INTEGER,
    url TEXT,
    scraped_date DATE NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Price history table (historical tracking)
CREATE TABLE price_history (
    id SERIAL PRIMARY KEY,
    sku VARCHAR(100) NOT NULL,
    price DECIMAL(10, 2) NOT NULL,
    scraped_date DATE NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(sku, scraped_date)
);

-- Indexes
CREATE INDEX idx_products_sku ON products(sku);
CREATE INDEX idx_products_category ON products(category);
CREATE INDEX idx_price_history_sku ON price_history(sku);
CREATE INDEX idx_price_history_date ON price_history(scraped_date);
Enter fullscreen mode Exit fullscreen mode

2. Scrapy Spider

# spiders/products.py

import scrapy

class ProductsSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example-shop.com/products']

    def parse(self, response):
        for product in response.css('.product'):
            yield {
                'sku': product.css('.sku::text').get(),
                'name': product.css('.name::text').get(),
                'brand': product.css('.brand::text').get(),
                'category': product.css('.category::text').get(),
                'price': self.clean_price(product.css('.price::text').get()),
                'rating': self.clean_rating(product.css('.rating::text').get()),
                'review_count': self.clean_reviews(product.css('.reviews::text').get()),
                'url': response.urljoin(product.css('a::attr(href)').get()),
            }

        # Pagination
        next_page = response.css('.next-page::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

    def clean_price(self, price_str):
        if not price_str:
            return None
        import re
        price = re.sub(r'[^\d.]', '', price_str)
        try:
            return float(price)
        except:
            return None

    def clean_rating(self, rating_str):
        if not rating_str:
            return None
        import re
        match = re.search(r'(\d+\.?\d*)', rating_str)
        return float(match.group(1)) if match else None

    def clean_reviews(self, reviews_str):
        if not reviews_str:
            return 0
        import re
        match = re.search(r'(\d+)', reviews_str)
        return int(match.group(1)) if match else 0
Enter fullscreen mode Exit fullscreen mode

3. Pipeline

# pipelines.py

import psycopg2
from datetime import datetime

class PriceMonitoringPipeline:
    def open_spider(self, spider):
        self.conn = psycopg2.connect(
            host='localhost',
            database='scraped_data',
            user='scraper',
            password='scraping123'
        )
        self.cur = self.conn.cursor()

    def close_spider(self, spider):
        self.cur.close()
        self.conn.close()

    def process_item(self, item, spider):
        today = datetime.now().date()

        # Update products table (current state)
        self.cur.execute("""
            INSERT INTO products (
                sku, name, brand, category, price, rating, 
                review_count, url, scraped_date
            ) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)
            ON CONFLICT (sku) DO UPDATE SET
                name = EXCLUDED.name,
                brand = EXCLUDED.brand,
                category = EXCLUDED.category,
                price = EXCLUDED.price,
                rating = EXCLUDED.rating,
                review_count = EXCLUDED.review_count,
                url = EXCLUDED.url,
                scraped_date = EXCLUDED.scraped_date
        """, (
            item['sku'], item['name'], item['brand'], item['category'],
            item['price'], item['rating'], item['review_count'],
            item['url'], today
        ))

        # Insert into price history
        self.cur.execute("""
            INSERT INTO price_history (sku, price, scraped_date)
            VALUES (%s, %s, %s)
            ON CONFLICT (sku, scraped_date) DO NOTHING
        """, (item['sku'], item['price'], today))

        self.conn.commit()
        return item
Enter fullscreen mode Exit fullscreen mode

4. Useful Queries

-- Daily price monitoring: Products with price changes today
SELECT 
    p.sku,
    p.name,
    p.category,
    p.price as current_price,
    ph.price as yesterday_price,
    p.price - ph.price as change,
    ROUND(((p.price - ph.price) / ph.price * 100), 2) as percent_change
FROM products p
JOIN price_history ph ON p.sku = ph.sku
WHERE p.scraped_date = CURRENT_DATE
  AND ph.scraped_date = CURRENT_DATE - 1
  AND p.price != ph.price
ORDER BY percent_change;

-- Weekly summary: Biggest price drops
SELECT 
    sku,
    name,
    category,
    MIN(price) as lowest_price,
    MAX(price) as highest_price,
    MAX(price) - MIN(price) as price_drop,
    ROUND(((MAX(price) - MIN(price)) / MAX(price) * 100), 2) as drop_percent
FROM price_history
JOIN products USING (sku)
WHERE scraped_date >= CURRENT_DATE - 7
GROUP BY sku, name, category
HAVING MAX(price) - MIN(price) > 0
ORDER BY price_drop DESC
LIMIT 20;

-- Category trends: Average price movement by category
SELECT 
    category,
    COUNT(DISTINCT sku) as products,
    ROUND(AVG(price_change), 2) as avg_change,
    COUNT(CASE WHEN price_change < 0 THEN 1 END) as prices_down,
    COUNT(CASE WHEN price_change > 0 THEN 1 END) as prices_up
FROM (
    SELECT 
        p.category,
        p.sku,
        p.price - ph.price as price_change
    FROM products p
    JOIN price_history ph ON p.sku = ph.sku
    WHERE p.scraped_date = CURRENT_DATE
      AND ph.scraped_date = CURRENT_DATE - 7
      AND p.price != ph.price
) changes
GROUP BY category
ORDER BY avg_change;
Enter fullscreen mode Exit fullscreen mode

5. Automated Report

# generate_report.py

import pandas as pd
from sqlalchemy import create_engine
from datetime import datetime

engine = create_engine('postgresql://scraper:scraping123@localhost:5432/scraped_data')

# Get price changes
query = """
    SELECT 
        p.sku,
        p.name,
        p.category,
        p.brand,
        p.price as current_price,
        ph.price as previous_price,
        p.price - ph.price as price_change,
        ROUND(((p.price - ph.price) / ph.price * 100), 2) as percent_change
    FROM products p
    JOIN price_history ph ON p.sku = ph.sku
    WHERE p.scraped_date = CURRENT_DATE
      AND ph.scraped_date = CURRENT_DATE - 1
      AND p.price != ph.price
    ORDER BY percent_change
"""

df = pd.read_sql(query, engine)

# Save report
filename = f'price_changes_{datetime.now().strftime("%Y%m%d")}.xlsx'
df.to_excel(filename, index=False)

print(f'Report generated: {filename}')
print(f'Products with price changes: {len(df)}')
print(f'Biggest drop: {df.iloc[0]["name"]} ({df.iloc[0]["percent_change"]}%)')
print(f'Biggest increase: {df.iloc[-1]["name"]} ({df.iloc[-1]["percent_change"]}%)')
Enter fullscreen mode Exit fullscreen mode
# Schedule daily with cron
0 9 * * * cd /path/to/project && python generate_report.py
Enter fullscreen mode Exit fullscreen mode

PostgreSQL vs CSV: The Real Comparison

Let's be honest about the differences.

CSV Advantages

  • Simple (just text files)
  • Easy to inspect (open in Excel)
  • No installation required
  • Easy to share (email, cloud storage)
  • Works with any tool

CSV Disadvantages

  • No querying (need to load into memory)
  • No relationships (can't join data)
  • No indexing (slow searches)
  • Duplicates everywhere
  • File management nightmare
  • Memory limits (can't load huge files)
  • No concurrent access (one person at a time)
  • No data validation (garbage in, garbage out)

PostgreSQL Advantages

  • Instant queries (no loading needed)
  • Handle billions of rows
  • Complex filtering and aggregation
  • Relationships between tables
  • Automatic duplicate handling
  • Multiple users simultaneously
  • Data validation (constraints)
  • Transactions (all or nothing)
  • Backups and recovery

PostgreSQL Disadvantages

  • Requires installation (though Docker helps)
  • Learning curve (SQL syntax)
  • More setup than df.to_csv()

Bottom line: Use CSV for one-time scrapes or sharing with non-technical people. Use PostgreSQL for everything else.


Summary

PostgreSQL transforms how you work with scraped data.

What you learned:

  • Install PostgreSQL with Docker (5 minutes)
  • Create tables for scraped data
  • Load existing CSVs into PostgreSQL
  • Build Scrapy pipelines that write directly to database
  • Track price changes over time
  • Query data instantly without loading files
  • Export reports when needed
  • Backup and maintain your database

The workflow:

Before: Scrape → CSV files → Manual Python scripts → Hours of work

After:  Scrape → PostgreSQL → SQL queries → Seconds
Enter fullscreen mode Exit fullscreen mode

When to migrate:

  • You have more than 10 CSV files of the same data
  • You need to query or filter data frequently
  • You're tracking changes over time
  • You're running out of memory with Pandas
  • You need multiple people to access data
  • You're building reports or dashboards

Getting started:

  1. Install Docker and PostgreSQL (one command)
  2. Create your first table (copy the SQL from this guide)
  3. Load one CSV file to test
  4. Write a simple query
  5. Never go back to CSV hell

PostgreSQL isn't complicated. It's actually simpler than managing 100 CSV files. You write a query. You get results. That's it.

Your scraped data deserves better than CSV files. Give it a real database.


Resources:

Top comments (0)