Why I Store All My Scraped Data in SQLite (Not JSON, Not CSV)

#python #database #discuss #beginners

For 2 years I saved scraped data as JSON files. One file per run. Sometimes CSV.

Then my projects grew, and JSON became a nightmare:

500 JSON files in a directory
No way to query across runs
Duplicate detection? Manual diffing
Data grew to 2GB+ and grep took minutes

I switched everything to SQLite. Here's why — and the exact pattern I use.

Why SQLite?

It's a single file. Your entire database is data.db. Copy it, back it up, email it.
It's built into Python. import sqlite3 — no install, no server, no Docker.
SQL queries. Need prices from last week? WHERE scraped_at > '2026-03-19'. Try that with 500 JSON files.
It handles millions of rows. SQLite comfortably handles 10M+ rows on a laptop.
It's fast. Inserts: 100K rows/second. Queries: milliseconds for most workloads.

The Pattern I Use Everywhere

import sqlite3
import json
from datetime import datetime

class ScrapingDB:
    def __init__(self, db_path='data.db'):
        self.conn = sqlite3.connect(db_path)
        self.conn.row_factory = sqlite3.Row
        self._create_tables()

    def _create_tables(self):
        self.conn.executescript('''
            CREATE TABLE IF NOT EXISTS items (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                source TEXT NOT NULL,
                url TEXT NOT NULL,
                data JSON NOT NULL,
                scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                UNIQUE(source, url)
            );

            CREATE TABLE IF NOT EXISTS runs (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                source TEXT NOT NULL,
                items_count INTEGER DEFAULT 0,
                started_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                finished_at TIMESTAMP
            );

            CREATE INDEX IF NOT EXISTS idx_items_source 
                ON items(source);
            CREATE INDEX IF NOT EXISTS idx_items_scraped 
                ON items(scraped_at);
        ''')

    def upsert(self, source: str, url: str, data: dict):
        """Insert or update item. Deduplication built-in."""
        self.conn.execute(
            '''INSERT INTO items (source, url, data) 
               VALUES (?, ?, ?)
               ON CONFLICT(source, url) 
               DO UPDATE SET data = ?, scraped_at = CURRENT_TIMESTAMP''',
            (source, url, json.dumps(data), json.dumps(data))
        )

    def save_batch(self, source: str, items: list[dict]):
        """Save multiple items in one transaction."""
        self.conn.execute('BEGIN')
        for item in items:
            self.upsert(source, item.get('url', ''), item)
        self.conn.commit()
        return len(items)

    def query(self, sql: str, params=()) -> list[dict]:
        """Run any SQL query."""
        rows = self.conn.execute(sql, params).fetchall()
        return [dict(row) for row in rows]

    def export_json(self, source: str, output='export.json'):
        """Export source data as JSON."""
        items = self.query(
            'SELECT data FROM items WHERE source = ? ORDER BY scraped_at DESC',
            (source,)
        )
        data = [json.loads(item['data']) for item in items]
        with open(output, 'w') as f:
            json.dump(data, f, indent=2)
        return len(data)

Usage Examples

Save scraped data

db = ScrapingDB()

# Single item
db.upsert('hackernews', 'https://news.ycombinator.com/item?id=123', {
    'title': 'Show HN: My Project',
    'score': 142,
    'comments': 37,
})

# Batch (100x faster)
db.save_batch('reddit', [
    {'url': 'https://reddit.com/r/python/123', 'title': 'TIL about SQLite', 'score': 847},
    {'url': 'https://reddit.com/r/python/456', 'title': 'New Python release', 'score': 1203},
])

Query across runs

# All HN posts with 100+ points from last week
hot_posts = db.query('''
    SELECT json_extract(data, '$.title') as title,
           json_extract(data, '$.score') as score
    FROM items 
    WHERE source = 'hackernews' 
      AND json_extract(data, '$.score') > 100
      AND scraped_at > datetime('now', '-7 days')
    ORDER BY score DESC
    LIMIT 20
''')

for post in hot_posts:
    print(f"{post['score']}pts: {post['title']}")

Detect new items

# Find items that appeared in the last 24 hours
new_items = db.query('''
    SELECT url, json_extract(data, '$.title') as title
    FROM items
    WHERE source = 'reddit'
      AND scraped_at > datetime('now', '-1 day')
    ORDER BY scraped_at DESC
''')
print(f"{len(new_items)} new items since yesterday")

SQLite vs JSON vs CSV

Feature	SQLite	JSON files	CSV
Query speed (1M rows)	Milliseconds	Minutes	Minutes
Deduplication	Built-in (UNIQUE)	Manual	Manual
Schema enforcement	Optional	None	Fragile
Multi-run tracking	Easy (timestamps)	Separate files	Separate files
Export to JSON/CSV	1 query	Already JSON	Already CSV
File management	1 file	100s of files	100s of files
Dependencies	None (built-in)	None	None
Concurrent writes	Supported	Race conditions	Race conditions

When NOT to Use SQLite

Multiple servers writing simultaneously — use PostgreSQL
You need a REST API — use Supabase or PocketBase
Data exceeds 10GB — consider DuckDB for analytics
Team collaboration — use a proper database server

For single-machine scraping? SQLite is perfect.

How do you store your scraped data? JSON files? PostgreSQL? Something else? I'm curious what works for different scales 👇

Need to extract data from any website or API? I build custom web scrapers and data pipelines. Check out my ready-made scrapers on Apify or email me at spinov001@gmail.com for a custom solution.

Need data from the web without writing scrapers? Check my *Apify actors** — ready-made scrapers for HN, Reddit, LinkedIn, and 75+ more sites. Or email: spinov001@gmail.com*