For 2 years I saved scraped data as JSON files. One file per run. Sometimes CSV.
Then my projects grew, and JSON became a nightmare:
- 500 JSON files in a directory
- No way to query across runs
- Duplicate detection? Manual diffing
- Data grew to 2GB+ and grep took minutes
I switched everything to SQLite. Here's why — and the exact pattern I use.
Why SQLite?
-
It's a single file. Your entire database is
data.db. Copy it, back it up, email it. -
It's built into Python.
import sqlite3— no install, no server, no Docker. -
SQL queries. Need prices from last week?
WHERE scraped_at > '2026-03-19'. Try that with 500 JSON files. - It handles millions of rows. SQLite comfortably handles 10M+ rows on a laptop.
- It's fast. Inserts: 100K rows/second. Queries: milliseconds for most workloads.
The Pattern I Use Everywhere
import sqlite3
import json
from datetime import datetime
class ScrapingDB:
def __init__(self, db_path='data.db'):
self.conn = sqlite3.connect(db_path)
self.conn.row_factory = sqlite3.Row
self._create_tables()
def _create_tables(self):
self.conn.executescript('''
CREATE TABLE IF NOT EXISTS items (
id INTEGER PRIMARY KEY AUTOINCREMENT,
source TEXT NOT NULL,
url TEXT NOT NULL,
data JSON NOT NULL,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
UNIQUE(source, url)
);
CREATE TABLE IF NOT EXISTS runs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
source TEXT NOT NULL,
items_count INTEGER DEFAULT 0,
started_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
finished_at TIMESTAMP
);
CREATE INDEX IF NOT EXISTS idx_items_source
ON items(source);
CREATE INDEX IF NOT EXISTS idx_items_scraped
ON items(scraped_at);
''')
def upsert(self, source: str, url: str, data: dict):
"""Insert or update item. Deduplication built-in."""
self.conn.execute(
'''INSERT INTO items (source, url, data)
VALUES (?, ?, ?)
ON CONFLICT(source, url)
DO UPDATE SET data = ?, scraped_at = CURRENT_TIMESTAMP''',
(source, url, json.dumps(data), json.dumps(data))
)
def save_batch(self, source: str, items: list[dict]):
"""Save multiple items in one transaction."""
self.conn.execute('BEGIN')
for item in items:
self.upsert(source, item.get('url', ''), item)
self.conn.commit()
return len(items)
def query(self, sql: str, params=()) -> list[dict]:
"""Run any SQL query."""
rows = self.conn.execute(sql, params).fetchall()
return [dict(row) for row in rows]
def export_json(self, source: str, output='export.json'):
"""Export source data as JSON."""
items = self.query(
'SELECT data FROM items WHERE source = ? ORDER BY scraped_at DESC',
(source,)
)
data = [json.loads(item['data']) for item in items]
with open(output, 'w') as f:
json.dump(data, f, indent=2)
return len(data)
Usage Examples
Save scraped data
db = ScrapingDB()
# Single item
db.upsert('hackernews', 'https://news.ycombinator.com/item?id=123', {
'title': 'Show HN: My Project',
'score': 142,
'comments': 37,
})
# Batch (100x faster)
db.save_batch('reddit', [
{'url': 'https://reddit.com/r/python/123', 'title': 'TIL about SQLite', 'score': 847},
{'url': 'https://reddit.com/r/python/456', 'title': 'New Python release', 'score': 1203},
])
Query across runs
# All HN posts with 100+ points from last week
hot_posts = db.query('''
SELECT json_extract(data, '$.title') as title,
json_extract(data, '$.score') as score
FROM items
WHERE source = 'hackernews'
AND json_extract(data, '$.score') > 100
AND scraped_at > datetime('now', '-7 days')
ORDER BY score DESC
LIMIT 20
''')
for post in hot_posts:
print(f"{post['score']}pts: {post['title']}")
Detect new items
# Find items that appeared in the last 24 hours
new_items = db.query('''
SELECT url, json_extract(data, '$.title') as title
FROM items
WHERE source = 'reddit'
AND scraped_at > datetime('now', '-1 day')
ORDER BY scraped_at DESC
''')
print(f"{len(new_items)} new items since yesterday")
SQLite vs JSON vs CSV
| Feature | SQLite | JSON files | CSV |
|---|---|---|---|
| Query speed (1M rows) | Milliseconds | Minutes | Minutes |
| Deduplication | Built-in (UNIQUE) | Manual | Manual |
| Schema enforcement | Optional | None | Fragile |
| Multi-run tracking | Easy (timestamps) | Separate files | Separate files |
| Export to JSON/CSV | 1 query | Already JSON | Already CSV |
| File management | 1 file | 100s of files | 100s of files |
| Dependencies | None (built-in) | None | None |
| Concurrent writes | Supported | Race conditions | Race conditions |
When NOT to Use SQLite
- Multiple servers writing simultaneously — use PostgreSQL
- You need a REST API — use Supabase or PocketBase
- Data exceeds 10GB — consider DuckDB for analytics
- Team collaboration — use a proper database server
For single-machine scraping? SQLite is perfect.
How do you store your scraped data? JSON files? PostgreSQL? Something else? I'm curious what works for different scales 👇
Top comments (0)