agenthustler

Posted on Mar 26 • Edited on Apr 19

How to Store Web Scraped Data in 2026: PostgreSQL, MongoDB, CSV, and Cloud Storage

#python #beginners #tutorial #webdev

You've built your scraper and it's pulling data beautifully. Now what? Where you store that data determines whether your project scales or collapses under its own weight.

This guide covers the four most common storage approaches for scraped data — flat files, relational databases, document stores, and cloud storage — with practical code examples so you can pick the right one for your use case.

Quick Decision Matrix

Storage	Best For	Scale	Setup
CSV/JSON files	Prototyping, small datasets (<100K rows)	Low	Zero
PostgreSQL	Structured data, deduplication, analytics	High	Medium
MongoDB	Semi-structured data, varying schemas	High	Medium
Cloud (S3/BigQuery)	Archival, massive datasets, team access	Very High	Higher

1. Flat Files: CSV and JSON

Perfect for quick experiments. Don't underestimate simplicity.

import csv
import json

# CSV — great for tabular product data
products = [
    {"name": "Widget Pro", "price": 29.99, "url": "https://example.com/widget"},
    {"name": "Gadget X", "price": 49.99, "url": "https://example.com/gadget"},
]

with open("products.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["name", "price", "url"])
    writer.writeheader()
    writer.writerows(products)

# JSON — better for nested/variable structures
with open("products.json", "w") as f:
    json.dump(products, f, indent=2)

When to move on from flat files:

You need deduplication (same product scraped twice)
Dataset exceeds ~100K rows
Multiple scrapers write concurrently
You need to query/filter data efficiently

2. PostgreSQL: The Workhorse

PostgreSQL handles 90% of scraping storage needs. It gives you deduplication via ON CONFLICT, indexing for fast lookups, and ACID transactions for concurrent scrapers.

Schema Design for Scraped Data

CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    external_id VARCHAR(255) UNIQUE,  -- dedupe key
    name VARCHAR(500) NOT NULL,
    price DECIMAL(10, 2),
    currency VARCHAR(3) DEFAULT 'USD',
    source_url TEXT,
    raw_html TEXT,                     -- optional: store raw for re-parsing
    scraped_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_products_external_id ON products(external_id);
CREATE INDEX idx_products_scraped_at ON products(scraped_at);

Inserting with Deduplication (Upsert)

import psycopg2

conn = psycopg2.connect("postgresql://user:pass@localhost/scraping_db")
cur = conn.cursor()

def save_product(product: dict):
    cur.execute("""
        INSERT INTO products (external_id, name, price, source_url)
        VALUES (%s, %s, %s, %s)
        ON CONFLICT (external_id) DO UPDATE SET
            name = EXCLUDED.name,
            price = EXCLUDED.price,
            updated_at = NOW()
        RETURNING id
    """, (
        product["external_id"],
        product["name"],
        product["price"],
        product["url"]
    ))
    conn.commit()
    return cur.fetchone()[0]

# Usage
save_product({
    "external_id": "AMZN-B09V3KXJPB",
    "name": "Wireless Mouse",
    "price": 24.99,
    "url": "https://example.com/mouse"
})

The ON CONFLICT ... DO UPDATE pattern is essential for scrapers. When you re-scrape the same product, it updates the existing row instead of creating duplicates.

Batch Inserts for Performance

When inserting thousands of rows, batch them:

from psycopg2.extras import execute_values

def save_products_batch(products: list[dict]):
    values = [
        (p["external_id"], p["name"], p["price"], p["url"])
        for p in products
    ]
    execute_values(cur, """
        INSERT INTO products (external_id, name, price, source_url)
        VALUES %s
        ON CONFLICT (external_id) DO UPDATE SET
            name = EXCLUDED.name,
            price = EXCLUDED.price,
            updated_at = NOW()
    """, values)
    conn.commit()

This is 10-50x faster than individual inserts.

3. MongoDB: Flexible Schemas

When your scraped data has varying structures — some products have reviews, some have specifications, some have neither — MongoDB shines.

from pymongo import MongoClient
from datetime import datetime

client = MongoClient("mongodb://localhost:27017")
db = client["scraping_db"]
products = db["products"]

# Create unique index for deduplication
products.create_index("external_id", unique=True)

def save_product(product: dict):
    product["scraped_at"] = datetime.utcnow()
    products.update_one(
        {"external_id": product["external_id"]},
        {"$set": product, "$setOnInsert": {"first_seen": datetime.utcnow()}},
        upsert=True
    )

# Each document can have different fields
save_product({
    "external_id": "PROD-001",
    "name": "Laptop Stand",
    "price": 45.00,
    "specs": {"material": "aluminum", "weight": "2.1kg"},
    "reviews_count": 342
})

save_product({
    "external_id": "PROD-002",
    "name": "USB Hub",
    "price": 19.99,
    "variants": ["4-port", "7-port"],
    # no specs or reviews — and that's fine
})

MongoDB vs PostgreSQL for scraping:

MongoDB: faster to start, flexible schema, good for exploration
PostgreSQL: better for analytics, joins, data integrity, long-term storage

4. Cloud Storage: S3 and BigQuery

For large-scale scraping operations or when multiple team members need access.

AWS S3 for Raw Data Archival

import boto3
import json
from datetime import datetime

s3 = boto3.client("s3")

def archive_scrape_results(data: list[dict], source: str):
    key = f"scrapes/{source}/{datetime.utcnow().strftime('%Y/%m/%d/%H%M%S')}.json"
    s3.put_object(
        Bucket="my-scraping-data",
        Key=key,
        Body=json.dumps(data),
        ContentType="application/json"
    )
    return key

Google BigQuery for Analytics

from google.cloud import bigquery

client = bigquery.Client()

# Load scraped data directly
job_config = bigquery.LoadJobConfig(
    source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
    autodetect=True,
    write_disposition="WRITE_APPEND"
)

with open("products.jsonl", "rb") as f:
    job = client.load_table_from_file(f, "project.dataset.products", job_config=job_config)
    job.result()  # wait for completion

Handling Common Scraping Storage Challenges

Price History Tracking

-- PostgreSQL: separate table for price history
CREATE TABLE price_history (
    id SERIAL PRIMARY KEY,
    product_id INTEGER REFERENCES products(id),
    price DECIMAL(10, 2),
    recorded_at TIMESTAMP DEFAULT NOW()
);

-- Only insert when price actually changes
INSERT INTO price_history (product_id, price)
SELECT p.id, 24.99
FROM products p
WHERE p.external_id = 'AMZN-B09V3KXJPB'
AND p.price != 24.99;

Data Validation Before Storage

def validate_product(product: dict) -> bool:
    required = ["external_id", "name", "price"]
    if not all(k in product for k in required):
        return False
    if not isinstance(product["price"], (int, float)) or product["price"] < 0:
        return False
    if len(product["name"]) > 500:
        return False
    return True

Scaling Tips

Use connection pooling — tools like PgBouncer prevent your scrapers from overwhelming the database.
Partition large tables — split by date (e.g., products_2026_03) for faster queries and easier archival.
Compress old data — move data older than 90 days to S3/GCS as compressed Parquet files.
Monitor disk usage — scrapers generate data fast. Set up alerts before you run out of space.

Tools That Handle Storage For You

If you want to skip the infrastructure setup, scraping platforms like ScrapeOps and ThorData provide built-in data pipelines that handle storage, deduplication, and export automatically.

Conclusion

Start simple. CSV files for prototyping, PostgreSQL when you need reliability, MongoDB when schemas vary, cloud storage when you need scale or team access. The ON CONFLICT upsert pattern in PostgreSQL solves 80% of scraping storage headaches.

The best storage solution is the one you actually set up and use — don't over-engineer it on day one.

Skip the Build

You don't have to reinvent this. We maintain a production-grade scraper as an Apify actor — proxies, anti-bot, retries, and schema all handled. You can run it on a pay-per-result basis and get clean JSON without writing a single line of scraping code.

Medium Scraper on Apify

DEV Community