You've been scraping for 3 months. You have 127 CSV files in a folder named "scraped_data".
products_march.csv, products_march_v2.csv, products_march_final.csv, products_march_final_REAL.csv, products_april_01.csv, products_april_02.csv...
You need to find when a specific product's price changed. You open 50 files manually. Use grep to search. Write Python scripts to loop through CSVs. It takes 2 hours to answer one question.
Then your boss asks: "Which products had the biggest price drops this month?" You spend an entire afternoon writing more scripts. Pandas DataFrames everywhere. Memory errors on large files. Duplicate products across files. Inconsistent column names.
You realize something. CSV files are not a database. They're just text files with commas. The moment you need to query data, compare records, or track changes over time, CSV breaks down completely.
Time for a real database.
PostgreSQL solves this. One database. All your scraped data in tables. Query anything in seconds. Track price changes automatically. No more file sprawl. No more custom scripts for every question.
Let me show you how.
When CSV Files Stop Working
Let's be honest about when you need a database.
Scenario 1: Finding Historical Data
CSV approach:
# You have 90 CSV files, one per day
# Question: What was the price of product X on March 15?
import pandas as pd
import glob
files = glob.glob('products_*.csv')
for file in files:
if 'march_15' in file:
df = pd.read_csv(file)
product = df[df['sku'] == 'ABC123']
if not product.empty:
print(product['price'].values[0])
break
Takes 5 minutes to write. Fails if filename doesn't match pattern.
PostgreSQL approach:
SELECT price
FROM products
WHERE sku = 'ABC123'
AND scraped_date = '2024-03-15';
Takes 5 seconds to write. Always works.
Scenario 2: Tracking Price Changes
CSV approach:
# Compare today's prices with last week
import pandas as pd
today = pd.read_csv('products_2024_03_22.csv')
last_week = pd.read_csv('products_2024_03_15.csv')
# Merge on SKU
merged = today.merge(
last_week,
on='sku',
suffixes=('_today', '_last_week')
)
# Find changes
changes = merged[merged['price_today'] != merged['price_last_week']]
Works for two files. Breaks when you need to check 90 days of data.
PostgreSQL approach:
SELECT
sku,
price,
LAG(price) OVER (PARTITION BY sku ORDER BY scraped_date) as previous_price,
price - LAG(price) OVER (PARTITION BY sku ORDER BY scraped_date) as price_change
FROM products
WHERE scraped_date >= CURRENT_DATE - INTERVAL '90 days';
Handles any date range. One query.
Scenario 3: Deduplication
CSV approach:
# Combine 30 CSV files and remove duplicates
import pandas as pd
import glob
dfs = []
for file in glob.glob('products_*.csv'):
df = pd.read_csv(file)
dfs.append(df)
combined = pd.concat(dfs, ignore_index=True)
unique = combined.drop_duplicates(subset=['sku'], keep='last')
Uses gigabytes of memory. Slow. Need to rerun every time.
PostgreSQL approach:
-- Duplicates automatically handled by primary key
-- Or query unique products anytime:
SELECT DISTINCT ON (sku) *
FROM products
ORDER BY sku, scraped_date DESC;
Instant. No memory issues.
Signs You Need a Database
You know you need PostgreSQL when:
- You have more than 10 CSV files of the same data
- You need to query data by date, product, or any field
- You're tracking changes over time
- You're comparing data across multiple scrapes
- You're running out of memory loading CSVs
- You spend more time managing files than analyzing data
- You need multiple people to access the data
- You're building reports or dashboards
If you checked 3 or more, stop reading and install PostgreSQL now.
Installing PostgreSQL (The Easy Way)
Skip the complicated installation guides. Use Docker.
Why Docker?
- Works on Windows, Mac, Linux identically
- No system-wide installation
- Easy to delete and start fresh
- Isolated from your main system
- Includes all tools you need
Install Docker
Mac:
# Download Docker Desktop from docker.com
# Or use homebrew:
brew install --cask docker
Windows:
# Download Docker Desktop from docker.com
# Make sure WSL2 is enabled
Linux:
# Ubuntu/Debian
sudo apt-get update
sudo apt-get install docker.io docker-compose
# Start Docker
sudo systemctl start docker
sudo systemctl enable docker
Start PostgreSQL Container
# Create a directory for data persistence
mkdir -p ~/postgres_data
# Run PostgreSQL container
docker run -d \
--name postgres_scraping \
-e POSTGRES_USER=scraper \
-e POSTGRES_PASSWORD=scraping123 \
-e POSTGRES_DB=scraped_data \
-p 5432:5432 \
-v ~/postgres_data:/var/lib/postgresql/data \
postgres:15
# Check it's running
docker ps
PostgreSQL is now running on localhost:5432.
Connection details:
- Host:
localhost - Port:
5432 - Database:
scraped_data - User:
scraper - Password:
scraping123
Install psql (PostgreSQL Client)
# Mac
brew install postgresql
# Ubuntu/Debian
sudo apt-get install postgresql-client
# Windows
# Download from postgresql.org or use Docker
Connect to PostgreSQL
# Using psql
psql -h localhost -U scraper -d scraped_data
# You'll see:
# Password for user scraper: (enter: scraping123)
# scraped_data=#
You're in! PostgreSQL is ready.
Your First Table: Products
Let's create a table for scraped products.
Understanding Table Structure
A table has:
- Columns: Fields (name, price, rating)
- Rows: Individual records (one product)
- Data types: What kind of data (text, numbers, dates)
- Constraints: Rules (not null, unique, primary key)
Creating the Products Table
-- Connect to database
psql -h localhost -U scraper -d scraped_data
-- Create products table
CREATE TABLE products (
id SERIAL PRIMARY KEY,
sku VARCHAR(100) UNIQUE NOT NULL,
name TEXT NOT NULL,
brand VARCHAR(200),
category VARCHAR(200),
price DECIMAL(10, 2),
original_price DECIMAL(10, 2),
rating DECIMAL(3, 2),
review_count INTEGER,
in_stock BOOLEAN,
url TEXT,
scraped_date DATE NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Add indexes for faster queries
CREATE INDEX idx_products_sku ON products(sku);
CREATE INDEX idx_products_scraped_date ON products(scraped_date);
CREATE INDEX idx_products_category ON products(category);
What each column means:
-
id: Auto-incrementing unique identifier -
sku: Product SKU (stock keeping unit), unique -
name: Product name (can be long, use TEXT) -
brand: Brand name (limited length, VARCHAR) -
price: Current price (DECIMAL for exact money) -
scraped_date: When you scraped this (for tracking changes) -
created_at: When row was added to database
Column types:
-
SERIAL: Auto-incrementing integer (1, 2, 3...) -
VARCHAR(n): Text with max length -
TEXT: Text with no length limit -
DECIMAL(10,2): Numbers with decimals (10 total digits, 2 after decimal) -
INTEGER: Whole numbers -
BOOLEAN: True/false -
DATE: Date without time (2024-03-15) -
TIMESTAMP: Date with time (2024-03-15 14:30:00)
Inserting Your First Product
INSERT INTO products (
sku, name, brand, category, price, original_price,
rating, review_count, in_stock, url, scraped_date
) VALUES (
'ABC123',
'Laptop Pro 15',
'TechBrand',
'Electronics',
1299.99,
1499.99,
4.5,
235,
true,
'https://example.com/products/laptop-pro-15',
'2024-03-15'
);
Querying Your Data
-- Get all products
SELECT * FROM products;
-- Get specific columns
SELECT name, price, rating FROM products;
-- Filter by category
SELECT * FROM products WHERE category = 'Electronics';
-- Find expensive products
SELECT * FROM products WHERE price > 1000 ORDER BY price DESC;
-- Count products by category
SELECT category, COUNT(*) as product_count
FROM products
GROUP BY category;
You just stored and queried data. No CSV files. No Pandas scripts.
Loading CSV Data Into PostgreSQL
You have existing CSV files. Let's migrate them.
Method 1: Using COPY Command (Fastest)
-- Assuming CSV has matching columns
COPY products (sku, name, brand, category, price, rating, scraped_date)
FROM '/path/to/products.csv'
WITH (FORMAT csv, HEADER true);
Advantages:
- Extremely fast (millions of rows per second)
- Built-in to PostgreSQL
Disadvantages:
- File must be on the PostgreSQL server
- Column names must match exactly
- Less flexible error handling
Method 2: Using Python (Most Flexible)
import pandas as pd
from sqlalchemy import create_engine
# Read CSV
df = pd.read_csv('products.csv')
# Clean data (using our Pandas cleaning from previous blog!)
df['price'] = df['price'].apply(clean_price)
df['scraped_date'] = pd.to_datetime(df['scraped_date'])
# Connect to PostgreSQL
engine = create_engine('postgresql://scraper:scraping123@localhost:5432/scraped_data')
# Load data
df.to_sql(
'products',
engine,
if_exists='append', # 'replace' to overwrite, 'append' to add
index=False
)
print(f'Loaded {len(df)} products into PostgreSQL')
This approach:
- Lets you clean data first
- Handles type conversions
- Better error messages
- Can transform data before loading
Method 3: Batch Loading Multiple CSVs
import pandas as pd
from sqlalchemy import create_engine
import glob
# Connect to PostgreSQL
engine = create_engine('postgresql://scraper:scraping123@localhost:5432/scraped_data')
# Find all CSV files
csv_files = glob.glob('scraped_data/*.csv')
total_loaded = 0
for csv_file in csv_files:
print(f'Loading {csv_file}...')
# Read and clean
df = pd.read_csv(csv_file)
df['price'] = pd.to_numeric(df['price'], errors='coerce')
df['scraped_date'] = pd.to_datetime(df['scraped_date'])
# Remove nulls in critical fields
df = df.dropna(subset=['sku', 'name', 'price'])
# Load to database
df.to_sql('products', engine, if_exists='append', index=False)
total_loaded += len(df)
print(f'Loaded {len(df)} products')
print(f'\nTotal: {total_loaded} products loaded into PostgreSQL')
All your CSVs are now in one database. Query anything instantly.
Scrapy Pipeline to PostgreSQL
Skip the CSV step entirely. Save directly from Scrapy to PostgreSQL.
Installing Required Libraries
pip install psycopg2-binary sqlalchemy
Creating the Pipeline
# pipelines.py
import psycopg2
from psycopg2.extras import execute_values
from datetime import datetime
class PostgresPipeline:
def __init__(self):
self.connection = None
self.cursor = None
self.items_buffer = []
self.buffer_size = 100 # Insert every 100 items
def open_spider(self, spider):
"""Connect to PostgreSQL when spider starts"""
self.connection = psycopg2.connect(
host='localhost',
database='scraped_data',
user='scraper',
password='scraping123',
port=5432
)
self.cursor = self.connection.cursor()
spider.logger.info('Connected to PostgreSQL')
def close_spider(self, spider):
"""Insert remaining items and close connection"""
if self.items_buffer:
self._insert_items()
self.cursor.close()
self.connection.close()
spider.logger.info('Disconnected from PostgreSQL')
def process_item(self, item, spider):
"""Buffer items and insert in batches"""
self.items_buffer.append(item)
if len(self.items_buffer) >= self.buffer_size:
self._insert_items()
return item
def _insert_items(self):
"""Insert buffered items into database"""
if not self.items_buffer:
return
# Prepare data for insertion
values = [
(
item.get('sku'),
item.get('name'),
item.get('brand'),
item.get('category'),
item.get('price'),
item.get('original_price'),
item.get('rating'),
item.get('review_count'),
item.get('in_stock', False),
item.get('url'),
datetime.now().date()
)
for item in self.items_buffer
]
# Insert with ON CONFLICT (upsert)
query = """
INSERT INTO products (
sku, name, brand, category, price, original_price,
rating, review_count, in_stock, url, scraped_date
) VALUES %s
ON CONFLICT (sku) DO UPDATE SET
name = EXCLUDED.name,
brand = EXCLUDED.brand,
category = EXCLUDED.category,
price = EXCLUDED.price,
original_price = EXCLUDED.original_price,
rating = EXCLUDED.rating,
review_count = EXCLUDED.review_count,
in_stock = EXCLUDED.in_stock,
url = EXCLUDED.url,
scraped_date = EXCLUDED.scraped_date
"""
execute_values(self.cursor, query, values)
self.connection.commit()
print(f'Inserted {len(self.items_buffer)} products')
# Clear buffer
self.items_buffer = []
Enable in settings.py
# settings.py
ITEM_PIPELINES = {
'myproject.pipelines.PostgresPipeline': 300,
}
How It Works
- Buffer items: Collect 100 items before inserting (faster than one-by-one)
- Batch insert: Insert all 100 at once
- Upsert logic: If SKU exists, update it; if not, insert new row
-
Automatic dates:
scraped_dateset to today automatically
Benefits:
- No CSV files created
- Data available immediately in database
- Handles duplicates automatically
- Much faster than individual inserts
Tracking Price Changes Over Time
The real power of databases: historical tracking.
Creating a Price History Table
CREATE TABLE price_history (
id SERIAL PRIMARY KEY,
sku VARCHAR(100) NOT NULL,
price DECIMAL(10, 2) NOT NULL,
scraped_date DATE NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
UNIQUE(sku, scraped_date) -- One price per product per day
);
CREATE INDEX idx_price_history_sku ON price_history(sku);
CREATE INDEX idx_price_history_date ON price_history(scraped_date);
Modified Pipeline for Price Tracking
# pipelines.py
class PriceTrackingPipeline:
def __init__(self):
self.connection = None
self.cursor = None
def open_spider(self, spider):
self.connection = psycopg2.connect(
host='localhost',
database='scraped_data',
user='scraper',
password='scraping123'
)
self.cursor = self.connection.cursor()
def close_spider(self, spider):
self.cursor.close()
self.connection.close()
def process_item(self, item, spider):
# Insert into products table (current state)
self.cursor.execute("""
INSERT INTO products (sku, name, price, scraped_date)
VALUES (%s, %s, %s, %s)
ON CONFLICT (sku) DO UPDATE SET
price = EXCLUDED.price,
scraped_date = EXCLUDED.scraped_date
""", (
item['sku'],
item['name'],
item['price'],
datetime.now().date()
))
# Insert into price_history table (historical tracking)
self.cursor.execute("""
INSERT INTO price_history (sku, price, scraped_date)
VALUES (%s, %s, %s)
ON CONFLICT (sku, scraped_date) DO NOTHING
""", (
item['sku'],
item['price'],
datetime.now().date()
))
self.connection.commit()
return item
Querying Price Changes
-- Products with price changes in last 7 days
SELECT
p.sku,
p.name,
p.price as current_price,
ph.price as previous_price,
p.price - ph.price as price_change,
((p.price - ph.price) / ph.price * 100) as percent_change
FROM products p
JOIN price_history ph ON p.sku = ph.sku
WHERE ph.scraped_date = CURRENT_DATE - INTERVAL '7 days'
AND p.price != ph.price
ORDER BY percent_change DESC;
-- Price trend for a specific product
SELECT
scraped_date,
price
FROM price_history
WHERE sku = 'ABC123'
ORDER BY scraped_date DESC
LIMIT 30;
-- Products with biggest price drops this month
SELECT
sku,
MIN(price) as lowest_price,
MAX(price) as highest_price,
MAX(price) - MIN(price) as price_drop
FROM price_history
WHERE scraped_date >= DATE_TRUNC('month', CURRENT_DATE)
GROUP BY sku
HAVING MAX(price) - MIN(price) > 0
ORDER BY price_drop DESC
LIMIT 20;
This is impossible with CSV files. With PostgreSQL, it's one query.
Useful Queries for Scraped Data
Here are queries you'll use constantly.
Finding Duplicates
-- Products scraped multiple times on same day
SELECT sku, scraped_date, COUNT(*) as count
FROM products
GROUP BY sku, scraped_date
HAVING COUNT(*) > 1;
Latest Products
-- Most recently scraped products
SELECT DISTINCT ON (sku) *
FROM products
ORDER BY sku, scraped_date DESC;
Missing Data
-- Products with missing prices
SELECT sku, name, scraped_date
FROM products
WHERE price IS NULL;
Category Statistics
-- Average price by category
SELECT
category,
COUNT(*) as product_count,
ROUND(AVG(price), 2) as avg_price,
ROUND(MIN(price), 2) as min_price,
ROUND(MAX(price), 2) as max_price
FROM products
WHERE price IS NOT NULL
GROUP BY category
ORDER BY avg_price DESC;
Date Range Queries
-- Products scraped in last 7 days
SELECT * FROM products
WHERE scraped_date >= CURRENT_DATE - INTERVAL '7 days';
-- Products scraped in March 2024
SELECT * FROM products
WHERE scraped_date >= '2024-03-01'
AND scraped_date < '2024-04-01';
Competitor Comparison
-- Compare prices across brands for same category
SELECT
brand,
COUNT(*) as products,
ROUND(AVG(price), 2) as avg_price
FROM products
WHERE category = 'Laptops'
GROUP BY brand
ORDER BY avg_price ASC;
Exporting Data from PostgreSQL
Sometimes you still need CSVs for sharing or analysis.
Export to CSV
# Using psql
psql -h localhost -U scraper -d scraped_data -c "COPY (SELECT * FROM products WHERE scraped_date = CURRENT_DATE) TO STDOUT WITH CSV HEADER" > products_today.csv
Export with Python
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('postgresql://scraper:scraping123@localhost:5432/scraped_data')
# Query and export
query = "SELECT * FROM products WHERE scraped_date = CURRENT_DATE"
df = pd.read_sql(query, engine)
df.to_csv('products_today.csv', index=False)
print(f'Exported {len(df)} products')
Export Price History Report
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('postgresql://scraper:scraping123@localhost:5432/scraped_data')
# Get price changes
query = """
SELECT
p.sku,
p.name,
p.category,
p.price as current_price,
ph.price as previous_price,
p.price - ph.price as price_change
FROM products p
JOIN price_history ph ON p.sku = ph.sku
WHERE ph.scraped_date = CURRENT_DATE - INTERVAL '7 days'
AND p.price != ph.price
"""
df = pd.read_sql(query, engine)
df.to_excel('price_changes_report.xlsx', index=False)
print(f'Price changes report saved: {len(df)} products')
Backup and Maintenance
Databases need care.
Automated Backups
#!/bin/bash
# backup.sh
# Create backup directory
mkdir -p ~/postgres_backups
# Backup database
docker exec postgres_scraping pg_dump -U scraper scraped_data > ~/postgres_backups/backup_$(date +%Y%m%d).sql
# Keep only last 30 days of backups
find ~/postgres_backups -name "backup_*.sql" -mtime +30 -delete
echo "Backup complete: backup_$(date +%Y%m%d).sql"
# Make executable
chmod +x backup.sh
# Add to crontab (run daily at 2 AM)
crontab -e
# Add line:
0 2 * * * /path/to/backup.sh
Restoring from Backup
# Stop Scrapy spiders first!
# Restore database
docker exec -i postgres_scraping psql -U scraper scraped_data < ~/postgres_backups/backup_20240315.sql
Cleaning Old Data
-- Delete products older than 90 days
DELETE FROM products
WHERE scraped_date < CURRENT_DATE - INTERVAL '90 days';
-- Delete old price history (keep 1 year)
DELETE FROM price_history
WHERE scraped_date < CURRENT_DATE - INTERVAL '1 year';
-- Vacuum to reclaim space
VACUUM ANALYZE;
Monitoring Database Size
-- Database size
SELECT pg_size_pretty(pg_database_size('scraped_data'));
-- Table sizes
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
Real Example: E-commerce Price Monitoring
Complete working example from scraping to querying.
1. Database Schema
-- Products table (current state)
CREATE TABLE products (
id SERIAL PRIMARY KEY,
sku VARCHAR(100) UNIQUE NOT NULL,
name TEXT NOT NULL,
brand VARCHAR(200),
category VARCHAR(200),
price DECIMAL(10, 2),
rating DECIMAL(3, 2),
review_count INTEGER,
url TEXT,
scraped_date DATE NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Price history table (historical tracking)
CREATE TABLE price_history (
id SERIAL PRIMARY KEY,
sku VARCHAR(100) NOT NULL,
price DECIMAL(10, 2) NOT NULL,
scraped_date DATE NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
UNIQUE(sku, scraped_date)
);
-- Indexes
CREATE INDEX idx_products_sku ON products(sku);
CREATE INDEX idx_products_category ON products(category);
CREATE INDEX idx_price_history_sku ON price_history(sku);
CREATE INDEX idx_price_history_date ON price_history(scraped_date);
2. Scrapy Spider
# spiders/products.py
import scrapy
class ProductsSpider(scrapy.Spider):
name = 'products'
start_urls = ['https://example-shop.com/products']
def parse(self, response):
for product in response.css('.product'):
yield {
'sku': product.css('.sku::text').get(),
'name': product.css('.name::text').get(),
'brand': product.css('.brand::text').get(),
'category': product.css('.category::text').get(),
'price': self.clean_price(product.css('.price::text').get()),
'rating': self.clean_rating(product.css('.rating::text').get()),
'review_count': self.clean_reviews(product.css('.reviews::text').get()),
'url': response.urljoin(product.css('a::attr(href)').get()),
}
# Pagination
next_page = response.css('.next-page::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)
def clean_price(self, price_str):
if not price_str:
return None
import re
price = re.sub(r'[^\d.]', '', price_str)
try:
return float(price)
except:
return None
def clean_rating(self, rating_str):
if not rating_str:
return None
import re
match = re.search(r'(\d+\.?\d*)', rating_str)
return float(match.group(1)) if match else None
def clean_reviews(self, reviews_str):
if not reviews_str:
return 0
import re
match = re.search(r'(\d+)', reviews_str)
return int(match.group(1)) if match else 0
3. Pipeline
# pipelines.py
import psycopg2
from datetime import datetime
class PriceMonitoringPipeline:
def open_spider(self, spider):
self.conn = psycopg2.connect(
host='localhost',
database='scraped_data',
user='scraper',
password='scraping123'
)
self.cur = self.conn.cursor()
def close_spider(self, spider):
self.cur.close()
self.conn.close()
def process_item(self, item, spider):
today = datetime.now().date()
# Update products table (current state)
self.cur.execute("""
INSERT INTO products (
sku, name, brand, category, price, rating,
review_count, url, scraped_date
) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)
ON CONFLICT (sku) DO UPDATE SET
name = EXCLUDED.name,
brand = EXCLUDED.brand,
category = EXCLUDED.category,
price = EXCLUDED.price,
rating = EXCLUDED.rating,
review_count = EXCLUDED.review_count,
url = EXCLUDED.url,
scraped_date = EXCLUDED.scraped_date
""", (
item['sku'], item['name'], item['brand'], item['category'],
item['price'], item['rating'], item['review_count'],
item['url'], today
))
# Insert into price history
self.cur.execute("""
INSERT INTO price_history (sku, price, scraped_date)
VALUES (%s, %s, %s)
ON CONFLICT (sku, scraped_date) DO NOTHING
""", (item['sku'], item['price'], today))
self.conn.commit()
return item
4. Useful Queries
-- Daily price monitoring: Products with price changes today
SELECT
p.sku,
p.name,
p.category,
p.price as current_price,
ph.price as yesterday_price,
p.price - ph.price as change,
ROUND(((p.price - ph.price) / ph.price * 100), 2) as percent_change
FROM products p
JOIN price_history ph ON p.sku = ph.sku
WHERE p.scraped_date = CURRENT_DATE
AND ph.scraped_date = CURRENT_DATE - 1
AND p.price != ph.price
ORDER BY percent_change;
-- Weekly summary: Biggest price drops
SELECT
sku,
name,
category,
MIN(price) as lowest_price,
MAX(price) as highest_price,
MAX(price) - MIN(price) as price_drop,
ROUND(((MAX(price) - MIN(price)) / MAX(price) * 100), 2) as drop_percent
FROM price_history
JOIN products USING (sku)
WHERE scraped_date >= CURRENT_DATE - 7
GROUP BY sku, name, category
HAVING MAX(price) - MIN(price) > 0
ORDER BY price_drop DESC
LIMIT 20;
-- Category trends: Average price movement by category
SELECT
category,
COUNT(DISTINCT sku) as products,
ROUND(AVG(price_change), 2) as avg_change,
COUNT(CASE WHEN price_change < 0 THEN 1 END) as prices_down,
COUNT(CASE WHEN price_change > 0 THEN 1 END) as prices_up
FROM (
SELECT
p.category,
p.sku,
p.price - ph.price as price_change
FROM products p
JOIN price_history ph ON p.sku = ph.sku
WHERE p.scraped_date = CURRENT_DATE
AND ph.scraped_date = CURRENT_DATE - 7
AND p.price != ph.price
) changes
GROUP BY category
ORDER BY avg_change;
5. Automated Report
# generate_report.py
import pandas as pd
from sqlalchemy import create_engine
from datetime import datetime
engine = create_engine('postgresql://scraper:scraping123@localhost:5432/scraped_data')
# Get price changes
query = """
SELECT
p.sku,
p.name,
p.category,
p.brand,
p.price as current_price,
ph.price as previous_price,
p.price - ph.price as price_change,
ROUND(((p.price - ph.price) / ph.price * 100), 2) as percent_change
FROM products p
JOIN price_history ph ON p.sku = ph.sku
WHERE p.scraped_date = CURRENT_DATE
AND ph.scraped_date = CURRENT_DATE - 1
AND p.price != ph.price
ORDER BY percent_change
"""
df = pd.read_sql(query, engine)
# Save report
filename = f'price_changes_{datetime.now().strftime("%Y%m%d")}.xlsx'
df.to_excel(filename, index=False)
print(f'Report generated: {filename}')
print(f'Products with price changes: {len(df)}')
print(f'Biggest drop: {df.iloc[0]["name"]} ({df.iloc[0]["percent_change"]}%)')
print(f'Biggest increase: {df.iloc[-1]["name"]} ({df.iloc[-1]["percent_change"]}%)')
# Schedule daily with cron
0 9 * * * cd /path/to/project && python generate_report.py
PostgreSQL vs CSV: The Real Comparison
Let's be honest about the differences.
CSV Advantages
- Simple (just text files)
- Easy to inspect (open in Excel)
- No installation required
- Easy to share (email, cloud storage)
- Works with any tool
CSV Disadvantages
- No querying (need to load into memory)
- No relationships (can't join data)
- No indexing (slow searches)
- Duplicates everywhere
- File management nightmare
- Memory limits (can't load huge files)
- No concurrent access (one person at a time)
- No data validation (garbage in, garbage out)
PostgreSQL Advantages
- Instant queries (no loading needed)
- Handle billions of rows
- Complex filtering and aggregation
- Relationships between tables
- Automatic duplicate handling
- Multiple users simultaneously
- Data validation (constraints)
- Transactions (all or nothing)
- Backups and recovery
PostgreSQL Disadvantages
- Requires installation (though Docker helps)
- Learning curve (SQL syntax)
- More setup than
df.to_csv()
Bottom line: Use CSV for one-time scrapes or sharing with non-technical people. Use PostgreSQL for everything else.
Summary
PostgreSQL transforms how you work with scraped data.
What you learned:
- Install PostgreSQL with Docker (5 minutes)
- Create tables for scraped data
- Load existing CSVs into PostgreSQL
- Build Scrapy pipelines that write directly to database
- Track price changes over time
- Query data instantly without loading files
- Export reports when needed
- Backup and maintain your database
The workflow:
Before: Scrape → CSV files → Manual Python scripts → Hours of work
After: Scrape → PostgreSQL → SQL queries → Seconds
When to migrate:
- You have more than 10 CSV files of the same data
- You need to query or filter data frequently
- You're tracking changes over time
- You're running out of memory with Pandas
- You need multiple people to access data
- You're building reports or dashboards
Getting started:
- Install Docker and PostgreSQL (one command)
- Create your first table (copy the SQL from this guide)
- Load one CSV file to test
- Write a simple query
- Never go back to CSV hell
PostgreSQL isn't complicated. It's actually simpler than managing 100 CSV files. You write a query. You get results. That's it.
Your scraped data deserves better than CSV files. Give it a real database.
Resources:
- PostgreSQL documentation: https://www.postgresql.org/docs/
- Docker PostgreSQL image: https://hub.docker.com/_/postgres
- psycopg2 documentation: https://www.psycopg.org/docs/
- SQLAlchemy documentation: https://docs.sqlalchemy.org/
Top comments (0)