DEV Community

agenthustler
agenthustler

Posted on

How to Store Web Scraped Data in 2026: PostgreSQL, MongoDB, CSV, and Cloud Storage

You've built your scraper and it's pulling data beautifully. Now what? Where you store that data determines whether your project scales or collapses under its own weight.

This guide covers the four most common storage approaches for scraped data — flat files, relational databases, document stores, and cloud storage — with practical code examples so you can pick the right one for your use case.

Quick Decision Matrix

Storage Best For Scale Setup
CSV/JSON files Prototyping, small datasets (<100K rows) Low Zero
PostgreSQL Structured data, deduplication, analytics High Medium
MongoDB Semi-structured data, varying schemas High Medium
Cloud (S3/BigQuery) Archival, massive datasets, team access Very High Higher

1. Flat Files: CSV and JSON

Perfect for quick experiments. Don't underestimate simplicity.

import csv
import json

# CSV — great for tabular product data
products = [
    {"name": "Widget Pro", "price": 29.99, "url": "https://example.com/widget"},
    {"name": "Gadget X", "price": 49.99, "url": "https://example.com/gadget"},
]

with open("products.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=["name", "price", "url"])
    writer.writeheader()
    writer.writerows(products)

# JSON — better for nested/variable structures
with open("products.json", "w") as f:
    json.dump(products, f, indent=2)
Enter fullscreen mode Exit fullscreen mode

When to move on from flat files:

  • You need deduplication (same product scraped twice)
  • Dataset exceeds ~100K rows
  • Multiple scrapers write concurrently
  • You need to query/filter data efficiently

2. PostgreSQL: The Workhorse

PostgreSQL handles 90% of scraping storage needs. It gives you deduplication via ON CONFLICT, indexing for fast lookups, and ACID transactions for concurrent scrapers.

Schema Design for Scraped Data

CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    external_id VARCHAR(255) UNIQUE,  -- dedupe key
    name VARCHAR(500) NOT NULL,
    price DECIMAL(10, 2),
    currency VARCHAR(3) DEFAULT 'USD',
    source_url TEXT,
    raw_html TEXT,                     -- optional: store raw for re-parsing
    scraped_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX idx_products_external_id ON products(external_id);
CREATE INDEX idx_products_scraped_at ON products(scraped_at);
Enter fullscreen mode Exit fullscreen mode

Inserting with Deduplication (Upsert)

import psycopg2

conn = psycopg2.connect("postgresql://user:pass@localhost/scraping_db")
cur = conn.cursor()

def save_product(product: dict):
    cur.execute("""
        INSERT INTO products (external_id, name, price, source_url)
        VALUES (%s, %s, %s, %s)
        ON CONFLICT (external_id) DO UPDATE SET
            name = EXCLUDED.name,
            price = EXCLUDED.price,
            updated_at = NOW()
        RETURNING id
    """, (
        product["external_id"],
        product["name"],
        product["price"],
        product["url"]
    ))
    conn.commit()
    return cur.fetchone()[0]

# Usage
save_product({
    "external_id": "AMZN-B09V3KXJPB",
    "name": "Wireless Mouse",
    "price": 24.99,
    "url": "https://example.com/mouse"
})
Enter fullscreen mode Exit fullscreen mode

The ON CONFLICT ... DO UPDATE pattern is essential for scrapers. When you re-scrape the same product, it updates the existing row instead of creating duplicates.

Batch Inserts for Performance

When inserting thousands of rows, batch them:

from psycopg2.extras import execute_values

def save_products_batch(products: list[dict]):
    values = [
        (p["external_id"], p["name"], p["price"], p["url"])
        for p in products
    ]
    execute_values(cur, """
        INSERT INTO products (external_id, name, price, source_url)
        VALUES %s
        ON CONFLICT (external_id) DO UPDATE SET
            name = EXCLUDED.name,
            price = EXCLUDED.price,
            updated_at = NOW()
    """, values)
    conn.commit()
Enter fullscreen mode Exit fullscreen mode

This is 10-50x faster than individual inserts.

3. MongoDB: Flexible Schemas

When your scraped data has varying structures — some products have reviews, some have specifications, some have neither — MongoDB shines.

from pymongo import MongoClient
from datetime import datetime

client = MongoClient("mongodb://localhost:27017")
db = client["scraping_db"]
products = db["products"]

# Create unique index for deduplication
products.create_index("external_id", unique=True)

def save_product(product: dict):
    product["scraped_at"] = datetime.utcnow()
    products.update_one(
        {"external_id": product["external_id"]},
        {"$set": product, "$setOnInsert": {"first_seen": datetime.utcnow()}},
        upsert=True
    )

# Each document can have different fields
save_product({
    "external_id": "PROD-001",
    "name": "Laptop Stand",
    "price": 45.00,
    "specs": {"material": "aluminum", "weight": "2.1kg"},
    "reviews_count": 342
})

save_product({
    "external_id": "PROD-002",
    "name": "USB Hub",
    "price": 19.99,
    "variants": ["4-port", "7-port"],
    # no specs or reviews — and that's fine
})
Enter fullscreen mode Exit fullscreen mode

MongoDB vs PostgreSQL for scraping:

  • MongoDB: faster to start, flexible schema, good for exploration
  • PostgreSQL: better for analytics, joins, data integrity, long-term storage

4. Cloud Storage: S3 and BigQuery

For large-scale scraping operations or when multiple team members need access.

AWS S3 for Raw Data Archival

import boto3
import json
from datetime import datetime

s3 = boto3.client("s3")

def archive_scrape_results(data: list[dict], source: str):
    key = f"scrapes/{source}/{datetime.utcnow().strftime('%Y/%m/%d/%H%M%S')}.json"
    s3.put_object(
        Bucket="my-scraping-data",
        Key=key,
        Body=json.dumps(data),
        ContentType="application/json"
    )
    return key
Enter fullscreen mode Exit fullscreen mode

Google BigQuery for Analytics

from google.cloud import bigquery

client = bigquery.Client()

# Load scraped data directly
job_config = bigquery.LoadJobConfig(
    source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
    autodetect=True,
    write_disposition="WRITE_APPEND"
)

with open("products.jsonl", "rb") as f:
    job = client.load_table_from_file(f, "project.dataset.products", job_config=job_config)
    job.result()  # wait for completion
Enter fullscreen mode Exit fullscreen mode

Handling Common Scraping Storage Challenges

Price History Tracking

-- PostgreSQL: separate table for price history
CREATE TABLE price_history (
    id SERIAL PRIMARY KEY,
    product_id INTEGER REFERENCES products(id),
    price DECIMAL(10, 2),
    recorded_at TIMESTAMP DEFAULT NOW()
);

-- Only insert when price actually changes
INSERT INTO price_history (product_id, price)
SELECT p.id, 24.99
FROM products p
WHERE p.external_id = 'AMZN-B09V3KXJPB'
AND p.price != 24.99;
Enter fullscreen mode Exit fullscreen mode

Data Validation Before Storage

def validate_product(product: dict) -> bool:
    required = ["external_id", "name", "price"]
    if not all(k in product for k in required):
        return False
    if not isinstance(product["price"], (int, float)) or product["price"] < 0:
        return False
    if len(product["name"]) > 500:
        return False
    return True
Enter fullscreen mode Exit fullscreen mode

Scaling Tips

  1. Use connection pooling — tools like PgBouncer prevent your scrapers from overwhelming the database.
  2. Partition large tables — split by date (e.g., products_2026_03) for faster queries and easier archival.
  3. Compress old data — move data older than 90 days to S3/GCS as compressed Parquet files.
  4. Monitor disk usage — scrapers generate data fast. Set up alerts before you run out of space.

Tools That Handle Storage For You

If you want to skip the infrastructure setup, scraping platforms like ScrapeOps and ThorData provide built-in data pipelines that handle storage, deduplication, and export automatically.

Conclusion

Start simple. CSV files for prototyping, PostgreSQL when you need reliability, MongoDB when schemas vary, cloud storage when you need scale or team access. The ON CONFLICT upsert pattern in PostgreSQL solves 80% of scraping storage headaches.

The best storage solution is the one you actually set up and use — don't over-engineer it on day one.

Top comments (0)