DEV Community

Alex Spinov
Alex Spinov

Posted on

Why I Store All My Scraped Data in SQLite (Not JSON, Not CSV)

For 2 years I saved scraped data as JSON files. One file per run. Sometimes CSV.

Then my projects grew, and JSON became a nightmare:

  • 500 JSON files in a directory
  • No way to query across runs
  • Duplicate detection? Manual diffing
  • Data grew to 2GB+ and grep took minutes

I switched everything to SQLite. Here's why — and the exact pattern I use.


Why SQLite?

  1. It's a single file. Your entire database is data.db. Copy it, back it up, email it.
  2. It's built into Python. import sqlite3 — no install, no server, no Docker.
  3. SQL queries. Need prices from last week? WHERE scraped_at > '2026-03-19'. Try that with 500 JSON files.
  4. It handles millions of rows. SQLite comfortably handles 10M+ rows on a laptop.
  5. It's fast. Inserts: 100K rows/second. Queries: milliseconds for most workloads.

The Pattern I Use Everywhere

import sqlite3
import json
from datetime import datetime

class ScrapingDB:
    def __init__(self, db_path='data.db'):
        self.conn = sqlite3.connect(db_path)
        self.conn.row_factory = sqlite3.Row
        self._create_tables()

    def _create_tables(self):
        self.conn.executescript('''
            CREATE TABLE IF NOT EXISTS items (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                source TEXT NOT NULL,
                url TEXT NOT NULL,
                data JSON NOT NULL,
                scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                UNIQUE(source, url)
            );

            CREATE TABLE IF NOT EXISTS runs (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                source TEXT NOT NULL,
                items_count INTEGER DEFAULT 0,
                started_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
                finished_at TIMESTAMP
            );

            CREATE INDEX IF NOT EXISTS idx_items_source 
                ON items(source);
            CREATE INDEX IF NOT EXISTS idx_items_scraped 
                ON items(scraped_at);
        ''')

    def upsert(self, source: str, url: str, data: dict):
        """Insert or update item. Deduplication built-in."""
        self.conn.execute(
            '''INSERT INTO items (source, url, data) 
               VALUES (?, ?, ?)
               ON CONFLICT(source, url) 
               DO UPDATE SET data = ?, scraped_at = CURRENT_TIMESTAMP''',
            (source, url, json.dumps(data), json.dumps(data))
        )

    def save_batch(self, source: str, items: list[dict]):
        """Save multiple items in one transaction."""
        self.conn.execute('BEGIN')
        for item in items:
            self.upsert(source, item.get('url', ''), item)
        self.conn.commit()
        return len(items)

    def query(self, sql: str, params=()) -> list[dict]:
        """Run any SQL query."""
        rows = self.conn.execute(sql, params).fetchall()
        return [dict(row) for row in rows]

    def export_json(self, source: str, output='export.json'):
        """Export source data as JSON."""
        items = self.query(
            'SELECT data FROM items WHERE source = ? ORDER BY scraped_at DESC',
            (source,)
        )
        data = [json.loads(item['data']) for item in items]
        with open(output, 'w') as f:
            json.dump(data, f, indent=2)
        return len(data)
Enter fullscreen mode Exit fullscreen mode

Usage Examples

Save scraped data

db = ScrapingDB()

# Single item
db.upsert('hackernews', 'https://news.ycombinator.com/item?id=123', {
    'title': 'Show HN: My Project',
    'score': 142,
    'comments': 37,
})

# Batch (100x faster)
db.save_batch('reddit', [
    {'url': 'https://reddit.com/r/python/123', 'title': 'TIL about SQLite', 'score': 847},
    {'url': 'https://reddit.com/r/python/456', 'title': 'New Python release', 'score': 1203},
])
Enter fullscreen mode Exit fullscreen mode

Query across runs

# All HN posts with 100+ points from last week
hot_posts = db.query('''
    SELECT json_extract(data, '$.title') as title,
           json_extract(data, '$.score') as score
    FROM items 
    WHERE source = 'hackernews' 
      AND json_extract(data, '$.score') > 100
      AND scraped_at > datetime('now', '-7 days')
    ORDER BY score DESC
    LIMIT 20
''')

for post in hot_posts:
    print(f"{post['score']}pts: {post['title']}")
Enter fullscreen mode Exit fullscreen mode

Detect new items

# Find items that appeared in the last 24 hours
new_items = db.query('''
    SELECT url, json_extract(data, '$.title') as title
    FROM items
    WHERE source = 'reddit'
      AND scraped_at > datetime('now', '-1 day')
    ORDER BY scraped_at DESC
''')
print(f"{len(new_items)} new items since yesterday")
Enter fullscreen mode Exit fullscreen mode

SQLite vs JSON vs CSV

Feature SQLite JSON files CSV
Query speed (1M rows) Milliseconds Minutes Minutes
Deduplication Built-in (UNIQUE) Manual Manual
Schema enforcement Optional None Fragile
Multi-run tracking Easy (timestamps) Separate files Separate files
Export to JSON/CSV 1 query Already JSON Already CSV
File management 1 file 100s of files 100s of files
Dependencies None (built-in) None None
Concurrent writes Supported Race conditions Race conditions

When NOT to Use SQLite

  • Multiple servers writing simultaneously — use PostgreSQL
  • You need a REST API — use Supabase or PocketBase
  • Data exceeds 10GB — consider DuckDB for analytics
  • Team collaboration — use a proper database server

For single-machine scraping? SQLite is perfect.


How do you store your scraped data? JSON files? PostgreSQL? Something else? I'm curious what works for different scales 👇

Top comments (0)