Muhammad Ikramullah Khan

Posted on Apr 12

PostgreSQL for Scrapers: From CSV Hell to Proper Data Storage

#python #database #web #data

You've been scraping for 3 months. You have 127 CSV files in a folder named "scraped_data".

products_march.csv, products_march_v2.csv, products_march_final.csv, products_march_final_REAL.csv, products_april_01.csv, products_april_02.csv...

You need to find when a specific product's price changed. You open 50 files manually. Use grep to search. Write Python scripts to loop through CSVs. It takes 2 hours to answer one question.

Then your boss asks: "Which products had the biggest price drops this month?" You spend an entire afternoon writing more scripts. Pandas DataFrames everywhere. Memory errors on large files. Duplicate products across files. Inconsistent column names.

You realize something. CSV files are not a database. They're just text files with commas. The moment you need to query data, compare records, or track changes over time, CSV breaks down completely.

Time for a real database.

PostgreSQL solves this. One database. All your scraped data in tables. Query anything in seconds. Track price changes automatically. No more file sprawl. No more custom scripts for every question.

Let me show you how.

When CSV Files Stop Working

Let's be honest about when you need a database.

Scenario 1: Finding Historical Data

CSV approach:

# You have 90 CSV files, one per day
# Question: What was the price of product X on March 15?

import pandas as pd
import glob

files = glob.glob('products_*.csv')
for file in files:
    if 'march_15' in file:
        df = pd.read_csv(file)
        product = df[df['sku'] == 'ABC123']
        if not product.empty:
            print(product['price'].values[0])
            break

Takes 5 minutes to write. Fails if filename doesn't match pattern.

PostgreSQL approach:

SELECT price 
FROM products 
WHERE sku = 'ABC123' 
  AND scraped_date = '2024-03-15';

Takes 5 seconds to write. Always works.

Scenario 2: Tracking Price Changes

CSV approach:

# Compare today's prices with last week
import pandas as pd

today = pd.read_csv('products_2024_03_22.csv')
last_week = pd.read_csv('products_2024_03_15.csv')

# Merge on SKU
merged = today.merge(
    last_week, 
    on='sku', 
    suffixes=('_today', '_last_week')
)

# Find changes
changes = merged[merged['price_today'] != merged['price_last_week']]

Works for two files. Breaks when you need to check 90 days of data.

PostgreSQL approach:

SELECT 
    sku,
    price,
    LAG(price) OVER (PARTITION BY sku ORDER BY scraped_date) as previous_price,
    price - LAG(price) OVER (PARTITION BY sku ORDER BY scraped_date) as price_change
FROM products
WHERE scraped_date >= CURRENT_DATE - INTERVAL '90 days';

Handles any date range. One query.

Scenario 3: Deduplication

CSV approach:

# Combine 30 CSV files and remove duplicates
import pandas as pd
import glob

dfs = []
for file in glob.glob('products_*.csv'):
    df = pd.read_csv(file)
    dfs.append(df)

combined = pd.concat(dfs, ignore_index=True)
unique = combined.drop_duplicates(subset=['sku'], keep='last')

Uses gigabytes of memory. Slow. Need to rerun every time.

PostgreSQL approach:

-- Duplicates automatically handled by primary key
-- Or query unique products anytime:
SELECT DISTINCT ON (sku) *
FROM products
ORDER BY sku, scraped_date DESC;

Instant. No memory issues.

Signs You Need a Database

You know you need PostgreSQL when:

You have more than 10 CSV files of the same data
You need to query data by date, product, or any field
You're tracking changes over time
You're comparing data across multiple scrapes
You're running out of memory loading CSVs
You spend more time managing files than analyzing data
You need multiple people to access the data
You're building reports or dashboards

If you checked 3 or more, stop reading and install PostgreSQL now.

Installing PostgreSQL (The Easy Way)

Skip the complicated installation guides. Use Docker.

Why Docker?

Works on Windows, Mac, Linux identically
No system-wide installation
Easy to delete and start fresh
Isolated from your main system
Includes all tools you need

Install Docker

Mac:

# Download Docker Desktop from docker.com
# Or use homebrew:
brew install --cask docker

Windows:

# Download Docker Desktop from docker.com
# Make sure WSL2 is enabled

Linux:

# Ubuntu/Debian
sudo apt-get update
sudo apt-get install docker.io docker-compose

# Start Docker
sudo systemctl start docker
sudo systemctl enable docker

Start PostgreSQL Container

# Create a directory for data persistence
mkdir -p ~/postgres_data

# Run PostgreSQL container
docker run -d \
  --name postgres_scraping \
  -e POSTGRES_USER=scraper \
  -e POSTGRES_PASSWORD=scraping123 \
  -e POSTGRES_DB=scraped_data \
  -p 5432:5432 \
  -v ~/postgres_data:/var/lib/postgresql/data \
  postgres:15

# Check it's running
docker ps

PostgreSQL is now running on localhost:5432.

Connection details:

Host: localhost
Port: 5432
Database: scraped_data
User: scraper
Password: scraping123

Install psql (PostgreSQL Client)

# Mac
brew install postgresql

# Ubuntu/Debian
sudo apt-get install postgresql-client

# Windows
# Download from postgresql.org or use Docker

Connect to PostgreSQL

# Using psql
psql -h localhost -U scraper -d scraped_data

# You'll see:
# Password for user scraper: (enter: scraping123)
# scraped_data=#

You're in! PostgreSQL is ready.

Your First Table: Products

Let's create a table for scraped products.

Understanding Table Structure

A table has:

Columns: Fields (name, price, rating)
Rows: Individual records (one product)
Data types: What kind of data (text, numbers, dates)
Constraints: Rules (not null, unique, primary key)

Creating the Products Table

-- Connect to database
psql -h localhost -U scraper -d scraped_data

-- Create products table
CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    sku VARCHAR(100) UNIQUE NOT NULL,
    name TEXT NOT NULL,
    brand VARCHAR(200),
    category VARCHAR(200),
    price DECIMAL(10, 2),
    original_price DECIMAL(10, 2),
    rating DECIMAL(3, 2),
    review_count INTEGER,
    in_stock BOOLEAN,
    url TEXT,
    scraped_date DATE NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Add indexes for faster queries
CREATE INDEX idx_products_sku ON products(sku);
CREATE INDEX idx_products_scraped_date ON products(scraped_date);
CREATE INDEX idx_products_category ON products(category);

What each column means:

id: Auto-incrementing unique identifier
sku: Product SKU (stock keeping unit), unique
name: Product name (can be long, use TEXT)
brand: Brand name (limited length, VARCHAR)
price: Current price (DECIMAL for exact money)
scraped_date: When you scraped this (for tracking changes)
created_at: When row was added to database

Column types:

SERIAL: Auto-incrementing integer (1, 2, 3...)
VARCHAR(n): Text with max length
TEXT: Text with no length limit
DECIMAL(10,2): Numbers with decimals (10 total digits, 2 after decimal)
INTEGER: Whole numbers
BOOLEAN: True/false
DATE: Date without time (2024-03-15)
TIMESTAMP: Date with time (2024-03-15 14:30:00)

Inserting Your First Product

INSERT INTO products (
    sku, name, brand, category, price, original_price, 
    rating, review_count, in_stock, url, scraped_date
) VALUES (
    'ABC123',
    'Laptop Pro 15',
    'TechBrand',
    'Electronics',
    1299.99,
    1499.99,
    4.5,
    235,
    true,
    'https://example.com/products/laptop-pro-15',
    '2024-03-15'
);

Querying Your Data

-- Get all products
SELECT * FROM products;

-- Get specific columns
SELECT name, price, rating FROM products;

-- Filter by category
SELECT * FROM products WHERE category = 'Electronics';

-- Find expensive products
SELECT * FROM products WHERE price > 1000 ORDER BY price DESC;

-- Count products by category
SELECT category, COUNT(*) as product_count 
FROM products 
GROUP BY category;

You just stored and queried data. No CSV files. No Pandas scripts.

Loading CSV Data Into PostgreSQL

You have existing CSV files. Let's migrate them.

Method 1: Using COPY Command (Fastest)

-- Assuming CSV has matching columns
COPY products (sku, name, brand, category, price, rating, scraped_date)
FROM '/path/to/products.csv'
WITH (FORMAT csv, HEADER true);

Advantages:

Extremely fast (millions of rows per second)
Built-in to PostgreSQL

Disadvantages:

File must be on the PostgreSQL server
Column names must match exactly
Less flexible error handling

Method 2: Using Python (Most Flexible)

import pandas as pd
from sqlalchemy import create_engine

# Read CSV
df = pd.read_csv('products.csv')

# Clean data (using our Pandas cleaning from previous blog!)
df['price'] = df['price'].apply(clean_price)
df['scraped_date'] = pd.to_datetime(df['scraped_date'])

# Connect to PostgreSQL
engine = create_engine('postgresql://scraper:scraping123@localhost:5432/scraped_data')

# Load data
df.to_sql(
    'products',
    engine,
    if_exists='append',  # 'replace' to overwrite, 'append' to add
    index=False
)

print(f'Loaded {len(df)} products into PostgreSQL')

This approach:

Lets you clean data first
Handles type conversions
Better error messages
Can transform data before loading

Method 3: Batch Loading Multiple CSVs

import pandas as pd
from sqlalchemy import create_engine
import glob

# Connect to PostgreSQL
engine = create_engine('postgresql://scraper:scraping123@localhost:5432/scraped_data')

# Find all CSV files
csv_files = glob.glob('scraped_data/*.csv')

total_loaded = 0

for csv_file in csv_files:
    print(f'Loading {csv_file}...')

    # Read and clean
    df = pd.read_csv(csv_file)
    df['price'] = pd.to_numeric(df['price'], errors='coerce')
    df['scraped_date'] = pd.to_datetime(df['scraped_date'])

    # Remove nulls in critical fields
    df = df.dropna(subset=['sku', 'name', 'price'])

    # Load to database
    df.to_sql('products', engine, if_exists='append', index=False)

    total_loaded += len(df)
    print(f'Loaded {len(df)} products')

print(f'\nTotal: {total_loaded} products loaded into PostgreSQL')

All your CSVs are now in one database. Query anything instantly.

Scrapy Pipeline to PostgreSQL

Skip the CSV step entirely. Save directly from Scrapy to PostgreSQL.

Installing Required Libraries

pip install psycopg2-binary sqlalchemy

Creating the Pipeline

# pipelines.py

import psycopg2
from psycopg2.extras import execute_values
from datetime import datetime

class PostgresPipeline:
    def __init__(self):
        self.connection = None
        self.cursor = None
        self.items_buffer = []
        self.buffer_size = 100  # Insert every 100 items

    def open_spider(self, spider):
        """Connect to PostgreSQL when spider starts"""
        self.connection = psycopg2.connect(
            host='localhost',
            database='scraped_data',
            user='scraper',
            password='scraping123',
            port=5432
        )
        self.cursor = self.connection.cursor()
        spider.logger.info('Connected to PostgreSQL')

    def close_spider(self, spider):
        """Insert remaining items and close connection"""
        if self.items_buffer:
            self._insert_items()

        self.cursor.close()
        self.connection.close()
        spider.logger.info('Disconnected from PostgreSQL')

    def process_item(self, item, spider):
        """Buffer items and insert in batches"""
        self.items_buffer.append(item)

        if len(self.items_buffer) >= self.buffer_size:
            self._insert_items()

        return item

    def _insert_items(self):
        """Insert buffered items into database"""
        if not self.items_buffer:
            return

        # Prepare data for insertion
        values = [
            (
                item.get('sku'),
                item.get('name'),
                item.get('brand'),
                item.get('category'),
                item.get('price'),
                item.get('original_price'),
                item.get('rating'),
                item.get('review_count'),
                item.get('in_stock', False),
                item.get('url'),
                datetime.now().date()
            )
            for item in self.items_buffer
        ]

        # Insert with ON CONFLICT (upsert)
        query = """
            INSERT INTO products (
                sku, name, brand, category, price, original_price,
                rating, review_count, in_stock, url, scraped_date
            ) VALUES %s
            ON CONFLICT (sku) DO UPDATE SET
                name = EXCLUDED.name,
                brand = EXCLUDED.brand,
                category = EXCLUDED.category,
                price = EXCLUDED.price,
                original_price = EXCLUDED.original_price,
                rating = EXCLUDED.rating,
                review_count = EXCLUDED.review_count,
                in_stock = EXCLUDED.in_stock,
                url = EXCLUDED.url,
                scraped_date = EXCLUDED.scraped_date
        """

        execute_values(self.cursor, query, values)
        self.connection.commit()

        print(f'Inserted {len(self.items_buffer)} products')

        # Clear buffer
        self.items_buffer = []

Enable in settings.py

# settings.py

ITEM_PIPELINES = {
    'myproject.pipelines.PostgresPipeline': 300,
}

How It Works

Buffer items: Collect 100 items before inserting (faster than one-by-one)
Batch insert: Insert all 100 at once
Upsert logic: If SKU exists, update it; if not, insert new row
Automatic dates: scraped_date set to today automatically

Benefits:

No CSV files created
Data available immediately in database
Handles duplicates automatically
Much faster than individual inserts

Tracking Price Changes Over Time

The real power of databases: historical tracking.

Creating a Price History Table

CREATE TABLE price_history (
    id SERIAL PRIMARY KEY,
    sku VARCHAR(100) NOT NULL,
    price DECIMAL(10, 2) NOT NULL,
    scraped_date DATE NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(sku, scraped_date)  -- One price per product per day
);

CREATE INDEX idx_price_history_sku ON price_history(sku);
CREATE INDEX idx_price_history_date ON price_history(scraped_date);

Modified Pipeline for Price Tracking

# pipelines.py

class PriceTrackingPipeline:
    def __init__(self):
        self.connection = None
        self.cursor = None

    def open_spider(self, spider):
        self.connection = psycopg2.connect(
            host='localhost',
            database='scraped_data',
            user='scraper',
            password='scraping123'
        )
        self.cursor = self.connection.cursor()

    def close_spider(self, spider):
        self.cursor.close()
        self.connection.close()

    def process_item(self, item, spider):
        # Insert into products table (current state)
        self.cursor.execute("""
            INSERT INTO products (sku, name, price, scraped_date)
            VALUES (%s, %s, %s, %s)
            ON CONFLICT (sku) DO UPDATE SET
                price = EXCLUDED.price,
                scraped_date = EXCLUDED.scraped_date
        """, (
            item['sku'],
            item['name'],
            item['price'],
            datetime.now().date()
        ))

        # Insert into price_history table (historical tracking)
        self.cursor.execute("""
            INSERT INTO price_history (sku, price, scraped_date)
            VALUES (%s, %s, %s)
            ON CONFLICT (sku, scraped_date) DO NOTHING
        """, (
            item['sku'],
            item['price'],
            datetime.now().date()
        ))

        self.connection.commit()

        return item

Querying Price Changes

-- Products with price changes in last 7 days
SELECT 
    p.sku,
    p.name,
    p.price as current_price,
    ph.price as previous_price,
    p.price - ph.price as price_change,
    ((p.price - ph.price) / ph.price * 100) as percent_change
FROM products p
JOIN price_history ph ON p.sku = ph.sku
WHERE ph.scraped_date = CURRENT_DATE - INTERVAL '7 days'
  AND p.price != ph.price
ORDER BY percent_change DESC;

-- Price trend for a specific product
SELECT 
    scraped_date,
    price
FROM price_history
WHERE sku = 'ABC123'
ORDER BY scraped_date DESC
LIMIT 30;

-- Products with biggest price drops this month
SELECT 
    sku,
    MIN(price) as lowest_price,
    MAX(price) as highest_price,
    MAX(price) - MIN(price) as price_drop
FROM price_history
WHERE scraped_date >= DATE_TRUNC('month', CURRENT_DATE)
GROUP BY sku
HAVING MAX(price) - MIN(price) > 0
ORDER BY price_drop DESC
LIMIT 20;

This is impossible with CSV files. With PostgreSQL, it's one query.

Useful Queries for Scraped Data

Here are queries you'll use constantly.

Finding Duplicates

-- Products scraped multiple times on same day
SELECT sku, scraped_date, COUNT(*) as count
FROM products
GROUP BY sku, scraped_date
HAVING COUNT(*) > 1;

Latest Products

-- Most recently scraped products
SELECT DISTINCT ON (sku) *
FROM products
ORDER BY sku, scraped_date DESC;

Missing Data

-- Products with missing prices
SELECT sku, name, scraped_date
FROM products
WHERE price IS NULL;

Category Statistics

-- Average price by category
SELECT 
    category,
    COUNT(*) as product_count,
    ROUND(AVG(price), 2) as avg_price,
    ROUND(MIN(price), 2) as min_price,
    ROUND(MAX(price), 2) as max_price
FROM products
WHERE price IS NOT NULL
GROUP BY category
ORDER BY avg_price DESC;

Date Range Queries

-- Products scraped in last 7 days
SELECT * FROM products
WHERE scraped_date >= CURRENT_DATE - INTERVAL '7 days';

-- Products scraped in March 2024
SELECT * FROM products
WHERE scraped_date >= '2024-03-01'
  AND scraped_date < '2024-04-01';

Competitor Comparison

-- Compare prices across brands for same category
SELECT 
    brand,
    COUNT(*) as products,
    ROUND(AVG(price), 2) as avg_price
FROM products
WHERE category = 'Laptops'
GROUP BY brand
ORDER BY avg_price ASC;

Exporting Data from PostgreSQL

Sometimes you still need CSVs for sharing or analysis.

Export to CSV

# Using psql
psql -h localhost -U scraper -d scraped_data -c "COPY (SELECT * FROM products WHERE scraped_date = CURRENT_DATE) TO STDOUT WITH CSV HEADER" > products_today.csv

Export with Python

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('postgresql://scraper:scraping123@localhost:5432/scraped_data')

# Query and export
query = "SELECT * FROM products WHERE scraped_date = CURRENT_DATE"
df = pd.read_sql(query, engine)
df.to_csv('products_today.csv', index=False)

print(f'Exported {len(df)} products')

Export Price History Report

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('postgresql://scraper:scraping123@localhost:5432/scraped_data')

# Get price changes
query = """
    SELECT 
        p.sku,
        p.name,
        p.category,
        p.price as current_price,
        ph.price as previous_price,
        p.price - ph.price as price_change
    FROM products p
    JOIN price_history ph ON p.sku = ph.sku
    WHERE ph.scraped_date = CURRENT_DATE - INTERVAL '7 days'
      AND p.price != ph.price
"""

df = pd.read_sql(query, engine)
df.to_excel('price_changes_report.xlsx', index=False)

print(f'Price changes report saved: {len(df)} products')

Backup and Maintenance

Databases need care.

Automated Backups

#!/bin/bash
# backup.sh

# Create backup directory
mkdir -p ~/postgres_backups

# Backup database
docker exec postgres_scraping pg_dump -U scraper scraped_data > ~/postgres_backups/backup_$(date +%Y%m%d).sql

# Keep only last 30 days of backups
find ~/postgres_backups -name "backup_*.sql" -mtime +30 -delete

echo "Backup complete: backup_$(date +%Y%m%d).sql"

# Make executable
chmod +x backup.sh

# Add to crontab (run daily at 2 AM)
crontab -e
# Add line:
0 2 * * * /path/to/backup.sh

Restoring from Backup

# Stop Scrapy spiders first!

# Restore database
docker exec -i postgres_scraping psql -U scraper scraped_data < ~/postgres_backups/backup_20240315.sql

Cleaning Old Data

-- Delete products older than 90 days
DELETE FROM products
WHERE scraped_date < CURRENT_DATE - INTERVAL '90 days';

-- Delete old price history (keep 1 year)
DELETE FROM price_history
WHERE scraped_date < CURRENT_DATE - INTERVAL '1 year';

-- Vacuum to reclaim space
VACUUM ANALYZE;

Monitoring Database Size

-- Database size
SELECT pg_size_pretty(pg_database_size('scraped_data'));

-- Table sizes
SELECT 
    schemaname,
    tablename,
    pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

Real Example: E-commerce Price Monitoring

Complete working example from scraping to querying.

1. Database Schema

-- Products table (current state)
CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    sku VARCHAR(100) UNIQUE NOT NULL,
    name TEXT NOT NULL,
    brand VARCHAR(200),
    category VARCHAR(200),
    price DECIMAL(10, 2),
    rating DECIMAL(3, 2),
    review_count INTEGER,
    url TEXT,
    scraped_date DATE NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Price history table (historical tracking)
CREATE TABLE price_history (
    id SERIAL PRIMARY KEY,
    sku VARCHAR(100) NOT NULL,
    price DECIMAL(10, 2) NOT NULL,
    scraped_date DATE NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(sku, scraped_date)
);

-- Indexes
CREATE INDEX idx_products_sku ON products(sku);
CREATE INDEX idx_products_category ON products(category);
CREATE INDEX idx_price_history_sku ON price_history(sku);
CREATE INDEX idx_price_history_date ON price_history(scraped_date);

2. Scrapy Spider

# spiders/products.py

import scrapy

class ProductsSpider(scrapy.Spider):
    name = 'products'
    start_urls = ['https://example-shop.com/products']

    def parse(self, response):
        for product in response.css('.product'):
            yield {
                'sku': product.css('.sku::text').get(),
                'name': product.css('.name::text').get(),
                'brand': product.css('.brand::text').get(),
                'category': product.css('.category::text').get(),
                'price': self.clean_price(product.css('.price::text').get()),
                'rating': self.clean_rating(product.css('.rating::text').get()),
                'review_count': self.clean_reviews(product.css('.reviews::text').get()),
                'url': response.urljoin(product.css('a::attr(href)').get()),
            }

        # Pagination
        next_page = response.css('.next-page::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)

    def clean_price(self, price_str):
        if not price_str:
            return None
        import re
        price = re.sub(r'[^\d.]', '', price_str)
        try:
            return float(price)
        except:
            return None

    def clean_rating(self, rating_str):
        if not rating_str:
            return None
        import re
        match = re.search(r'(\d+\.?\d*)', rating_str)
        return float(match.group(1)) if match else None

    def clean_reviews(self, reviews_str):
        if not reviews_str:
            return 0
        import re
        match = re.search(r'(\d+)', reviews_str)
        return int(match.group(1)) if match else 0

3. Pipeline

# pipelines.py

import psycopg2
from datetime import datetime

class PriceMonitoringPipeline:
    def open_spider(self, spider):
        self.conn = psycopg2.connect(
            host='localhost',
            database='scraped_data',
            user='scraper',
            password='scraping123'
        )
        self.cur = self.conn.cursor()

    def close_spider(self, spider):
        self.cur.close()
        self.conn.close()

    def process_item(self, item, spider):
        today = datetime.now().date()

        # Update products table (current state)
        self.cur.execute("""
            INSERT INTO products (
                sku, name, brand, category, price, rating, 
                review_count, url, scraped_date
            ) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)
            ON CONFLICT (sku) DO UPDATE SET
                name = EXCLUDED.name,
                brand = EXCLUDED.brand,
                category = EXCLUDED.category,
                price = EXCLUDED.price,
                rating = EXCLUDED.rating,
                review_count = EXCLUDED.review_count,
                url = EXCLUDED.url,
                scraped_date = EXCLUDED.scraped_date
        """, (
            item['sku'], item['name'], item['brand'], item['category'],
            item['price'], item['rating'], item['review_count'],
            item['url'], today
        ))

        # Insert into price history
        self.cur.execute("""
            INSERT INTO price_history (sku, price, scraped_date)
            VALUES (%s, %s, %s)
            ON CONFLICT (sku, scraped_date) DO NOTHING
        """, (item['sku'], item['price'], today))

        self.conn.commit()
        return item

4. Useful Queries

-- Daily price monitoring: Products with price changes today
SELECT 
    p.sku,
    p.name,
    p.category,
    p.price as current_price,
    ph.price as yesterday_price,
    p.price - ph.price as change,
    ROUND(((p.price - ph.price) / ph.price * 100), 2) as percent_change
FROM products p
JOIN price_history ph ON p.sku = ph.sku
WHERE p.scraped_date = CURRENT_DATE
  AND ph.scraped_date = CURRENT_DATE - 1
  AND p.price != ph.price
ORDER BY percent_change;

-- Weekly summary: Biggest price drops
SELECT 
    sku,
    name,
    category,
    MIN(price) as lowest_price,
    MAX(price) as highest_price,
    MAX(price) - MIN(price) as price_drop,
    ROUND(((MAX(price) - MIN(price)) / MAX(price) * 100), 2) as drop_percent
FROM price_history
JOIN products USING (sku)
WHERE scraped_date >= CURRENT_DATE - 7
GROUP BY sku, name, category
HAVING MAX(price) - MIN(price) > 0
ORDER BY price_drop DESC
LIMIT 20;

-- Category trends: Average price movement by category
SELECT 
    category,
    COUNT(DISTINCT sku) as products,
    ROUND(AVG(price_change), 2) as avg_change,
    COUNT(CASE WHEN price_change < 0 THEN 1 END) as prices_down,
    COUNT(CASE WHEN price_change > 0 THEN 1 END) as prices_up
FROM (
    SELECT 
        p.category,
        p.sku,
        p.price - ph.price as price_change
    FROM products p
    JOIN price_history ph ON p.sku = ph.sku
    WHERE p.scraped_date = CURRENT_DATE
      AND ph.scraped_date = CURRENT_DATE - 7
      AND p.price != ph.price
) changes
GROUP BY category
ORDER BY avg_change;

5. Automated Report

# generate_report.py

import pandas as pd
from sqlalchemy import create_engine
from datetime import datetime

engine = create_engine('postgresql://scraper:scraping123@localhost:5432/scraped_data')

# Get price changes
query = """
    SELECT 
        p.sku,
        p.name,
        p.category,
        p.brand,
        p.price as current_price,
        ph.price as previous_price,
        p.price - ph.price as price_change,
        ROUND(((p.price - ph.price) / ph.price * 100), 2) as percent_change
    FROM products p
    JOIN price_history ph ON p.sku = ph.sku
    WHERE p.scraped_date = CURRENT_DATE
      AND ph.scraped_date = CURRENT_DATE - 1
      AND p.price != ph.price
    ORDER BY percent_change
"""

df = pd.read_sql(query, engine)

# Save report
filename = f'price_changes_{datetime.now().strftime("%Y%m%d")}.xlsx'
df.to_excel(filename, index=False)

print(f'Report generated: {filename}')
print(f'Products with price changes: {len(df)}')
print(f'Biggest drop: {df.iloc[0]["name"]} ({df.iloc[0]["percent_change"]}%)')
print(f'Biggest increase: {df.iloc[-1]["name"]} ({df.iloc[-1]["percent_change"]}%)')

# Schedule daily with cron
0 9 * * * cd /path/to/project && python generate_report.py

PostgreSQL vs CSV: The Real Comparison

Let's be honest about the differences.

CSV Advantages

Simple (just text files)
Easy to inspect (open in Excel)
No installation required
Easy to share (email, cloud storage)
Works with any tool

CSV Disadvantages

No querying (need to load into memory)
No relationships (can't join data)
No indexing (slow searches)
Duplicates everywhere
File management nightmare
Memory limits (can't load huge files)
No concurrent access (one person at a time)
No data validation (garbage in, garbage out)

PostgreSQL Advantages

Instant queries (no loading needed)
Handle billions of rows
Complex filtering and aggregation
Relationships between tables
Automatic duplicate handling
Multiple users simultaneously
Data validation (constraints)
Transactions (all or nothing)
Backups and recovery

PostgreSQL Disadvantages

Requires installation (though Docker helps)
Learning curve (SQL syntax)
More setup than df.to_csv()

Bottom line: Use CSV for one-time scrapes or sharing with non-technical people. Use PostgreSQL for everything else.

Summary

PostgreSQL transforms how you work with scraped data.

What you learned:

Install PostgreSQL with Docker (5 minutes)
Create tables for scraped data
Load existing CSVs into PostgreSQL
Build Scrapy pipelines that write directly to database
Track price changes over time
Query data instantly without loading files
Export reports when needed
Backup and maintain your database

The workflow:

Before: Scrape → CSV files → Manual Python scripts → Hours of work

After:  Scrape → PostgreSQL → SQL queries → Seconds

When to migrate:

You have more than 10 CSV files of the same data
You need to query or filter data frequently
You're tracking changes over time
You're running out of memory with Pandas
You need multiple people to access data
You're building reports or dashboards

Getting started:

Install Docker and PostgreSQL (one command)
Create your first table (copy the SQL from this guide)
Load one CSV file to test
Write a simple query
Never go back to CSV hell

PostgreSQL isn't complicated. It's actually simpler than managing 100 CSV files. You write a query. You get results. That's it.

Your scraped data deserves better than CSV files. Give it a real database.

Resources:

PostgreSQL documentation: https://www.postgresql.org/docs/
Docker PostgreSQL image: https://hub.docker.com/_/postgres
psycopg2 documentation: https://www.psycopg.org/docs/
SQLAlchemy documentation: https://docs.sqlalchemy.org/