DEV Community: Tanay Kedia

Your ORM is the bottleneck

Tanay Kedia — Sun, 24 May 2026 08:14:07 +0000

SQLite handles 88,000 writes per second on commodity hardware. My ORM caps at 3,800. PRAGMA tuning does not move that number.

I learned this the hard way, over 9.4 hours of benchmarking. 11 SQLite configurations, 230 million rows, 37 measured datapoints. Across every config I threw at it, ORM throughput landed between 3,045 and 3,821 rows per second. A 26% spread on the thing every guide on the internet is busy tuning. Same workload, rewritten as raw executemany, did 87,893.

The database was fine. It was always fine.

The headline numbers

Path	10M rows	50M rows	p99 latency
SQLAlchemy ORM	3,696 r/s (45 min)	3,682 r/s (3.8 hrs)	1,492 ms
Raw `executemany`	87,893 r/s (1.9 min)	65,742 r/s (12.7 min)	478 ms
Speedup	23.8×	17.9×	3.1×

Zero errors across all runs. Every config. Every scale.

What I actually ran:

9.4 hours of continuous benchmarking
230 million rows written across all runs
11 configurations × 3 checkpoints + 2 paths × 2 scales = 37 datapoints
Single host, NVMe storage, fresh database file per run

Why I started running this

SQLite is having a moment. Rails 8 ships with it as a first-class production database. Kent C. Dodds moved off a Postgres cluster to run it. Expensify is doing 4 million queries per second on a single SQLite file. Cloudflare D1 runs it at the edge with 8 ms P99 reads.

So I sat down to tune mine. WAL, mmap, cache size, sync mode, the usual PRAGMA folklore. I had a question I wanted answered, and it was not the one most blogs answer.

Most blogs tune SQLite. I wanted to know whether any of that tuning matters when you write through an ORM.

It does not. Not even a little.

What I tested it on

Component	Spec
CPU	8-core / 16-thread
RAM	23.2 GB DDR
Storage	Samsung 990 EVO Plus 1TB NVMe
OS	Linux 6.8.0
Python	3.11
SQLAlchemy	2.0 (sync)
SQLite	WAL mode, QueuePool
Schema	Single table, 11 columns, UUID primary key

Everything ran on the same machine, same NVMe, same kernel. No Docker. No network. Local SQLite, period.

The two benchmarks

Two harnesses, two questions.

Config sweep. 11 SQLite configurations, 10M rows each, all through SQLAlchemy. Does PRAGMA tuning matter?

Before / after. Same workload, ORM path vs raw executemany, at 10M and 50M. How much throughput does the ORM cost?

The ORM path

# Insert phase: bulk_save_objects
session.bulk_save_objects([BenchRow(**row) for row in chunk])
session.commit()

# Upsert phase: insert().on_conflict_do_update()
stmt = insert(BenchRow).values(chunk)
stmt = stmt.on_conflict_do_update(
    index_elements=["tenant_id", "entity_id", "sub_entity_id", "bucket_index"],
    set_={col: stmt.excluded[col] for col in update_cols}
)
session.execute(stmt)
session.commit()

The raw path

# sqlite3.executemany via the dbapi connection. Skips SQLAlchemy entirely.
raw_conn = session.connection().connection.dbapi_connection
cursor = raw_conn.cursor()
cursor.executemany(
    "INSERT INTO bench_rows (row_id, tenant_id, entity_id, ...) VALUES (?, ?, ?, ...)",
    rows_as_tuples
)
raw_conn.commit()

Both paths use the same generator, same schema, same target row count. The only difference is whether SQLAlchemy is in the picture.

Schema

A single 11-column table, designed to look like a real production table rather than a synthetic benchmark.

Column	Type	Role
`row_id`	String(36)	UUID primary key
`tenant_id`	String(36), indexed	Logical sharding key
`entity_id`	Integer	Natural-key component
`sub_entity_id`	Integer	Natural-key component
`bucket_index`	Integer	Natural-key component
`measurement_x / y / z`	Float	Numeric payload
`category_id`	Integer	Categorical field
`weight`	Float	Probabilistic payload
`status`	Integer	State machine field
`created_at / updated_at`	DateTime	Tracking timestamps

Unique constraint on (tenant_id, entity_id, sub_entity_id, bucket_index), which is the upsert conflict target. Row size lands around 500 bytes including the UUID.

Data generation

A streaming generator that yields chunks of unique rows with deterministic natural keys. Constant memory regardless of total scale. It never materialises the full row set.

def streaming_chunks(total_rows, chunk_size):
    counter = 0
    while counter < total_rows:
        chunk = []
        for _ in range(chunk_size):
            chunk.append({
                "row_id": uuid4().hex,
                "tenant_id": tenant_id,
                "entity_id": counter % 1000,
                "sub_entity_id": (counter // 1000) % 1000,
                "bucket_index": counter // 1_000_000,
                # ... numeric and categorical fields
            })
            counter += 1
        yield chunk

The determinism matters. Re-running the same generator produces identical natural keys, which forces the upsert path to actually exercise ON CONFLICT instead of trivially inserting fresh rows. Without this, you are benchmarking inserts and calling it an upsert benchmark.

Checkpoints and recovery

Results saved at 3M, 5M, and 10M for the config sweep. 10M and 50M for the before / after. Intermediate JSON written after each segment, with --resume to pick up where we left off. The 50M ORM run took 3.8 hours. The laptop suspended once. The resume capability paid for itself.

How much compute went into this

Phase	Duration	Rows	Datapoints
Config sweep (11 configs × 10M)	5.6 hours	110M	33
Before / after (10M + 50M, both modes)	3.8 hours	120M	4
Total	9.4 hours	230M	37

Single host, single process at a time. Background psutil sampler at 1Hz throughout. Each config got a freshly created database file via tempfile.TemporaryDirectory(), so no run was polluted by any prior run.

Benchmark 1: the config sweep

11 SQLite configurations, 10 million rows each, all written through SQLAlchemy.

Config	Chunk size	Sync	Cache	Pool	mmap
baseline	5,000	NORMAL	-4096	5	0
chunk_1000	1,000	NORMAL	-64000	5	256MB
chunk_5000	5,000	NORMAL	-64000	5	256MB
chunk_10000	10,000	NORMAL	-64000	5	256MB
chunk_25000	25,000	NORMAL	-64000	5	256MB
chunk_50000	50,000	NORMAL	-64000	5	256MB
optimized	10,000	NORMAL	-64000	5	256MB
aggressive	25,000	OFF	-64000	5	256MB
pool_3	10,000	NORMAL	-64000	3	256MB
pool_5	10,000	NORMAL	-64000	5	256MB
pool_8	10,000	NORMAL	-64000	8	256MB

Results at 10M rows

Config	Throughput	p99 latency	Peak RSS
chunk_1000	3,821 r/s	313 ms	790 MB
baseline	3,802 r/s	1,418 ms	195 MB
chunk_5000	3,673 r/s	1,500 ms	511 MB
chunk_10000	3,621 r/s	2,887 ms	610 MB
optimized	3,576 r/s	2,982 ms	611 MB
pool_5	3,574 r/s	2,960 ms	609 MB
pool_8	3,573 r/s	2,957 ms	606 MB
pool_3	3,445 r/s	3,200 ms	610 MB
chunk_25000	3,262 r/s	8,490 ms	629 MB
aggressive	3,243 r/s	8,536 ms	595 MB
chunk_50000	3,045 r/s	20,508 ms	633 MB

Three things to notice.

The first is the spread. Best to worst, 3,821 vs 3,045 rows per second. That is 26% across configs that range from sync=OFF to default PRAGMAs, from 1K to 50K chunk sizes, from 3 to 8 pool connections. Whatever PRAGMA folklore you have read, on this workload, it is doing roughly nothing. The ORM is consuming the CPU budget before I/O becomes a factor.

The second is aggressive, which I configured with sync=OFF and 25K chunks expecting it to win. It came second-from-last at 3,243 r/s. I sacrificed durability for nothing. Forward Email found the same thing: sync=OFF is sometimes slower than NORMAL because the WAL machinery is different, not just absent. Shivek Khurana calls the difference "not very significant". It is not.

The third is pool size. pool_3, pool_5, pool_8: 3,445 / 3,574 / 3,573. Under 4% spread. SQLite is single-writer no matter how many connections you open. SQLAlchemy's QueuePool just serialises writes at the application layer, which is functionally identical to the single-writer architecture every production SQLite deployment converges on.

Baseline wins on the throughput-per-resource ratio. 5K chunks, default PRAGMAs, pool of 5. 3,802 r/s using only 195 MB of RSS. chunk_1000 ekes out 0.5% more throughput but uses 4× the memory. Not a trade I am taking.

Benchmark 2: ORM vs raw SQL

Method	Scale	Throughput	Duration	p99	Peak RSS
ORM	10M	3,696 r/s	45.1 min	1,492 ms	177 MB
ORM	50M	3,682 r/s	226.3 min	1,563 ms	177 MB
Raw `executemany`	10M	87,893 r/s	1.9 min	478 ms	155 MB
Raw `executemany`	50M	65,742 r/s	12.7 min	853 ms	188 MB

23.8× at 10M. 17.9× at 50M. Three and a half hours of wall time turn into thirteen minutes.

Where the ORM time goes

Every row through SQLAlchemy goes through this:

Raw executemany does this:

Two layers, seven gone. None of the seven are free. Each one is a Python attribute access, a dict lookup, a function call, something the JIT cannot inline because of how SQLAlchemy is structured. On a one-shot insert it is invisible. On 10 million rows it costs 43 minutes.

Scaling behaviour, both paths

ORM throughput is flat. 3,696 r/s at 10M. 3,682 r/s at 50M. I/O doesn't get a vote because the ORM is busy.

Raw throughput degrades at scale, which is interesting. 87,893 r/s at 10M, 65,742 r/s at 50M. That is a 25% drop. The database file grew from 3.8 GB to 19.5 GB, the B-tree got deeper, WAL checkpoints got more expensive. This is SQLite's actual I/O scaling curve. The ORM never gets close enough to it to see it.

Chunk size controls latency, not throughput

Of all the things I tuned, chunk size was the only one that mattered, and not for the reason I expected.

Chunk size	Throughput	p99	Ratio
1,000	3,821 r/s	313 ms	1.0×
5,000	3,673 r/s	1,500 ms	4.8×
10,000	3,621 r/s	2,887 ms	9.2×
25,000	3,262 r/s	8,490 ms	27.1×
50,000	3,045 r/s	20,508 ms	65.5×

p99 scales linearly with chunk size. 50× larger chunks, 66× worse p99. Throughput moves 20%.

Each chunk is one transaction. Larger transactions hold the write lock longer, block WAL checkpoints longer, allocate more memory. The SQLite write lock is exclusive. A 20-second transaction means every other writer waits 20 seconds.

So: use the smallest chunks your throughput requirement allows. 1K to 5K is the sweet spot. You get ~3,800 r/s either way, and your p99 drops from 20 seconds to 300 milliseconds.

The full data nobody usually shows

The 10M table is the endpoint. The story is the flatness, which you can only see if you look at how throughput evolved during the run.

Cumulative throughput at each checkpoint

Config	@3M	@5M	@10M	Δ (3M→10M)
aggressive	3,232	3,219	3,243	+0.3%
baseline	3,760	3,761	3,802	+1.1%
chunk_1000	3,871	3,854	3,821	-1.3%
chunk_5000	3,734	3,715	3,673	-1.6%
chunk_10000	3,640	3,625	3,621	-0.5%
chunk_25000	3,384	3,349	3,262	-3.6%
chunk_50000	2,966	3,009	3,045	+2.7%
optimized	3,640	3,605	3,576	-1.8%
pool_3	3,472	3,463	3,445	-0.8%
pool_5	3,570	3,577	3,574	+0.1%
pool_8	3,598	3,588	3,573	-0.7%

No config moved more than ±3.6% from 3M to 10M. Every single one is flat. Whatever overhead the ORM imposes at row one, it imposes at row ten million.

This is not how databases normally behave. Most B-tree workloads degrade at scale because the tree gets deeper, fewer pages fit in cache, fsync takes longer. None of that mattered here. The ORM was dominating so completely that SQLite's actual scaling curve never showed up.

Per-segment throughput

Config	0→3M	3M→5M	5M→10M
baseline	3,760	3,762	3,844
chunk_1000	3,871	3,828	3,789
chunk_50000	2,966	3,076	3,081

The 5M→10M segment is sometimes faster than 0→3M. Filesystem cache is warm by then, no cold-start cost. The ORM is so far from being I/O-bound that even SQLite's I/O improvements show up positively.

p99 at each checkpoint

Config	@3M	@5M	@10M
chunk_1000	312 ms	312 ms	313 ms
baseline	1,403 ms	1,410 ms	1,418 ms
chunk_5000	1,463 ms	1,536 ms	1,500 ms
pool_5	3,005 ms	2,847 ms	2,960 ms
chunk_10000	2,887 ms	2,887 ms	2,887 ms
optimized	3,039 ms	2,903 ms	2,982 ms
chunk_25000	7,529 ms	7,710 ms	8,490 ms
aggressive	8,764 ms	8,536 ms	8,536 ms
chunk_50000	20,647 ms	16,492 ms	20,508 ms

p99 is flat too. Whatever your p99 looks like at 3M, that is what it looks like at 10M. Your latency budget is set by your chunk size, full stop.

Before / after with every field

Field	ORM @ 10M	ORM @ 50M	Raw @ 10M	Raw @ 50M
Throughput	3,696 r/s	3,682 r/s	87,893 r/s	65,742 r/s
Duration	2,705 s	13,581 s	114 s	761 s
Batch count	2,000	8,000	200	800
Batch p50	1,320 ms	1,324 ms	424 ms	652 ms
Batch p95	1,424 ms	1,434 ms	470 ms	830 ms
Batch p99	1,492 ms	1,563 ms	478 ms	853 ms
Batch avg	1,324 ms	1,330 ms	424 ms	659 ms
Peak RSS	177 MB	177 MB	155 MB	188 MB
DB size	3.5 GB	17.8 GB	3.9 GB	19.5 GB
Segment throughput (10M→50M)	n/a	3,678 r/s	n/a	61,846 r/s

A few things jump out.

ORM memory is flat at 177 MB regardless of scale. We're streaming through chunks, no accumulation. The 5× DB size growth costs zero memory. The streaming approach is validated.

Raw memory grows modestly, 155 to 188 MB. Larger chunks (50K vs 5K) hold larger tuple lists briefly. Trivial.

The raw 10M→50M segment ran at 61,846 r/s versus 87,893 r/s for the first 10M. That 30% drop is SQLite's actual I/O cost showing through: B-tree depth, page splits, WAL checkpoint time. Completely invisible behind the ORM ceiling, which was at 3,700 r/s either way.

ORM p99 is 3.1× worse than raw at 10M and gets worse with scale (1,492 → 1,563 ms versus 478 → 853 ms). Even the ORM's own baseline cannot keep p99 stable.

DB sizes differ slightly between paths (3,518 MB ORM vs 3,851 MB raw at 10M). Both are correct. Page-fill heuristics and WAL checkpoint timing diverge slightly. Functionally identical data.

How this lines up with the rest of the internet

Raw SQLite write throughput

Source	Hardware	Throughput	Notes
Marending (2024)	M1 Mac	113,684 w/s	WAL + sync=NORMAL, mixed 80/20 hit 197,012 ops/s
Anders Murphy (2025)	M1 Pro 16GB	121,922 TPS	Dynamic batching, 1B rows
Anders Murphy (2025)	M1 Pro 16GB	44,096 TPS	No batching, 1B rows
Our data	Linux x86 NVMe	87,893 r/s	Python executemany @ 10M, UUID PK + 4-col upsert
Marending	Linux x86 Hetzner CPX31	80,145 w/s	WAL + sync=NORMAL
tenthousandmeters	Not specified	72,769 ops/s	1KB records, single-threaded
Our data	Linux x86 NVMe	65,742 r/s	Python executemany @ 50M
Evan Schwartz	Not specified	60,061 r/s	Single-writer connection
Marending	Linux ARM Hetzner CAX31	46,512 w/s	WAL + sync=NORMAL
Shivek Khurana	i9 MacBook 32GB	15,576 w/s	128 workers, WAL mode
Forward Email	Node.js v20	11,800 inserts/s	wal_autocheckpoint=1000
Forward Email	Node.js v20	10,548 inserts/s	Production baseline

My raw numbers sit in the middle. 65K to 88K rows per second on commodity NVMe through Python is what the ecosystem reports for similar hardware. Nothing exotic going on.

ORM overhead across stacks

Source	Stack	Raw	ORM	Overhead
Our data	Python/SQLAlchemy 2.0/SQLite @ 10M	87,893	3,696	23.8×
Our data	Python/SQLAlchemy 2.0/SQLite @ 50M	65,742	3,682	17.9×
SQLAlchemy bulk benchmarks	Python/SQLAlchemy/PG	Core insert	`session.add()` loop	40×
Same	Python/SQLAlchemy/PG	Core insert	`bulk_save_objects`	15×
Same	Python/SQLAlchemy/PG	PG `COPY`	`session.add()` loop	240×
Evan Schwartz	Rust/sqlx	60,061	2,586 (50-conn pool)	23×
remusao	Python/sqlite3	625K r/s	`execute` loop	1.7×

The 17–40× tax on bulk operations through an ORM is consistent across stacks. The 240× COPY gap shows what happens when you bypass both the ORM and the Python driver. Worth flagging.

Production deployments at scale

Company	Scale	Architecture	Notable choice
Expensify	4M QPS, 10B rows	Custom Bedrock layer, bare metal 192 cores	Modified SQLite (disabled POSIX locks)
37signals (ONCE)	1000s of installs	Per-customer SQLite, Rails 8	Solid adapters (cache, queue, cable)
Kent C. Dodds	6 regions	LiteFS on Fly.io	Postgres cluster → SQLite
Cloudflare D1	Edge, global	SQLite + Workers	8 ms P99 reads, 500–2K writes/s
Turso	Embedded replicas	SQLite + libSQL	624 µs reads, 40 µs connections
Litestream / Ben Johnson	Sub-ms queries	SQLite + WAL replication to S3	10–20 µs per query, 50–100× faster than intra-region Postgres
extensionpay.com	~120M req/month	$14 DigitalOcean droplet	3+ years on SQLite + Litestream to B2
Glench (HN)	Production SaaS	Single $14 droplet	Memory mapping was biggest perf gain
tazu (HN)	Mid-six-figure SaaS	95% reads / 5% writes	Separate reader pool + single writer
hruk (HN)	8-figure ARR	SQLite + Litestream on EC2	~250 µs insert latency on EBS

Some of these are extreme. Expensify is doing 4 million QPS on a custom SQLite fork running on a single $30K box. Most are not. Glench is doing 120 million requests a month on a $14 droplet. The interesting thing is the architectural convergence, which I'll come back to.

Sync mode, cross-referenced

Source	sync=OFF	sync=NORMAL	sync=FULL/EXTRA	Conclusion
Our data (aggressive vs baseline)	3,243 r/s	3,802 r/s	not measured	NORMAL wins via ORM
Forward Email (Node.js)	10,017	10,548	3,495 (EXTRA)	OFF slower than NORMAL; EXTRA 3× slower
tenthousandmeters	not measured	72,769	29,000 (FULL)	NORMAL 2.5× faster than FULL
Shivek Khurana	not measured	"not significant"	"not significant"	-6% to +10% variance

Consensus: NORMAL. OFF saves nothing meaningful. FULL and EXTRA can be 2 to 3× slower. The folklore that OFF is a free win for performance is just wrong.

Connection pool architecture

Source	Architecture	Throughput	Note
Our data (pool_3 / pool_5 / pool_8)	QueuePool 3–8	3,445 / 3,574 / 3,573 r/s	<4% spread
Evan Schwartz	50-conn pool	2,586 r/s	p99 = 182 seconds
Evan Schwartz	Single writer	60,061 r/s	23× faster, p99 = 82 ms
Stephen Margheim	Reader pool + IMMEDIATE writer	Zero errors up to 16 concurrent	Default DEFERRED fails at 4+
tenthousandmeters	App-level mutex	56K–66K r/s stable	Holds up to 256 threads

Everyone arrives at the same answer eventually: serialise writes at the application layer. Whether that's QueuePool, an app mutex, or a single-writer connection, the architecture is identical in spirit.

The production config everyone agrees on

If you cross-reference OneUptime, Forward Email, Shivek Khurana, Stephen Margheim, the SQLite docs, and my own data, the config converges:

PRAGMA journal_mode = WAL;          -- Concurrent reads + writes
PRAGMA synchronous = NORMAL;        -- Safe for WAL mode, 2-13% faster than FULL
PRAGMA busy_timeout = 5000;         -- 5s minimum; 30s for high-contention apps
PRAGMA cache_size = -64000;         -- 64MB; diminishing returns beyond this
PRAGMA mmap_size = 268435456;       -- 256MB; helps read-heavy workloads
PRAGMA temp_store = MEMORY;         -- Faster temp tables (watch RSS on big VACUUMs)
PRAGMA foreign_keys = ON;           -- Data integrity
PRAGMA auto_vacuum = INCREMENTAL;   -- Reclaim space without full rebuild
PRAGMA wal_autocheckpoint = 1000;   -- Default, ~4MB WAL before checkpoint

Connection architecture

Every production deployment I looked at converged on single-writer plus multi-reader. SQLAlchemy's QueuePool with pool_size=5 does the same thing in practice, which is why I saw zero errors across 110 million rows.

What does not move the needle when you're using an ORM

PRAGMA	Effect on ORM throughput	Why
synchronous = OFF vs NORMAL	< 5%	ORM overhead dominates I/O savings
cache_size 4MB vs 64MB	< 3%	B-tree lookups cheap vs Python object creation
mmap_size 0 vs 256MB	< 2%	Reads are fast, writes are ORM-bound
pool_size 3 vs 8	< 4%	Single-writer means pool size is irrelevant for writes

If you skipped to this table, here is the summary: nothing on the left changes the right.

When to throw the ORM out

The ORM is fine for most things. CRUD, reads, validation, relationship traversal, normal application work. The fast path is for the hot bulk routes. Use this decision tree:

Scenario	ORM	Raw SQL	What to do
CRUD, < 1K rows	Good enough	Premature	ORM
Bulk load 10K–1M rows	2–5 min per million	7 sec per million	Raw `executemany`
Bulk load 1M–100M rows	45 min per 10M	1.9 min per 10M	Raw `executemany`
Sustained > 1K r/s ingest	Ceiling at 3.8K	Headroom to 88K	Raw `executemany`
Upsert-heavy workloads	Works but slow	18× faster	Raw `executemany`

Hybrid repositories

The pattern I now use in production: ORM for normal CRUD, a raw fast path for bulk.

class BenchRepository:
    def create(self, data: dict) -> BenchRow:
        # Single-row CRUD. ORM is fine.
        obj = BenchRow(**data)
        self.db.add(obj)
        self.db.flush()
        return obj

    def bulk_insert_turbo(self, rows: Iterable[dict]) -> int:
        # Bulk path. 23× faster than the ORM version.
        raw_conn = self.db.connection().connection.dbapi_connection
        cursor = raw_conn.cursor()
        tuples = [self._dict_to_tuple(r) for r in rows]
        cursor.executemany(INSERT_SQL, tuples)
        raw_conn.commit()
        return len(tuples)

ORM for correctness. Raw SQL for throughput. Both in the same repository, both committed to the same session, both touching the same tables.

Six things I did not expect

`temp_store = MEMORY` can be slower than disk

Forward Email's measurements showed disk-based temp storage beating memory. Large VACUUM operations can chew through 10+ GB with memory temp storage and either OOM or thrash swap. The folklore that "memory is faster" is not always true. Measure.

SQLite's default busy handler uses exponential backoff, and it is bad

Stephen Margheim showed that uniform 1ms retry intervals beat SQLite's built-in exponential backoff. P99.99 drops from seconds to ~500 ms. The default backoff penalises long-waiting queries, which is the wrong end of the distribution to penalise.

WAL is slower than DELETE at low concurrency

Khurana's data shows WAL is 43% slower than DELETE at 1 worker. It only breaks even at 4 workers and starts winning at 8. If you have a single-threaded batch job, the rollback journal is faster. Counterintuitive, but the WAL machinery has overhead you only amortise across concurrent operations.

A 50-connection pool is 23× slower than a single writer

Evan Schwartz measured 2,586 vs 60,061 r/s. The 50-connection pool creates pseudo-serialisation with scheduling overhead. The single writer just goes. Same throughput pattern as ORM, different layer: software making sure writers don't collide, but doing it badly.

Node.js version matters more than SQLite config

Forward Email reported Node v24 reads were 57% slower than v20 on the same database with the same PRAGMAs. The application layer dominates so completely that a runtime upgrade can dwarf years of database tuning. I cannot help feeling vindicated by this.

Write degradation at 10M–100M rows is real

Multiple HN reports mention writes degrading once tables hit 8–9 figures. My raw numbers show it: 87,893 r/s at 10M, 65,742 r/s at 50M, a 25% drop driven by B-tree depth, page splits, and WAL checkpoint cost. Through the ORM, you never see it. The ORM is too slow to reach the curve.

Reproducing this

If you want to run the same benchmarks, the repo is open source.

git clone https://github.com/TanayK07/sqlite-orm-bench.git
cd sqlite-orm-bench
pip install -e .

# Smoke test, ~5 seconds
python -m sqlite_bench.before_after --scales 10K --mode both

# ORM vs raw at 10M, ~50 minutes
python -m sqlite_bench.before_after --scales 10M --mode both

# Full 11-config sweep at 10M, ~5.5 hours
python -m sqlite_bench.scale_benchmark --scales 3M,5M,10M --configs all

# Pick up after a crash
python -m sqlite_bench.scale_benchmark \
  --scales 3M,5M,10M --configs all \
  --resume results/scale/results.json

Requirements: Python 3.11+, SQLAlchemy 2.0+, NVMe storage, 8+ GB RAM, Linux.

The harness measures throughput cumulative and per-segment, p50/p95/p99 per batch, peak RSS, DB and WAL sizes, and error counts. Results go to JSON with --resume support, because long runs crash and the laptop will close its lid.

If you run it on different hardware, please open an issue. I'm collecting a cross-hardware table.

What I would do differently if I were starting today

Six conclusions, all from the data above.

The ORM is the bottleneck. Not SQLite. Across 11 configurations, my ORM throughput never broke 3,821 r/s. Raw executemany on the same hardware hit 87,893. SQLite can absorb 23× more writes than my ORM can produce. Hours spent tuning PRAGMAs against an ORM-bound workload are hours you do not get back.

PRAGMA tuning is irrelevant for ORM-bound workloads. sync=OFF, 64 MB cache, 256 MB mmap, 8-connection pool, none of it moves the needle more than 26%. The ORM's per-row work (object creation, attribute instrumentation, identity map, SQL compilation) consumes the CPU budget before SQLite gets a chance to be slow.

Chunk size is the only tunable that matters for latency. p99 scales 66× across chunk sizes (313 ms at 1K, 20,508 ms at 50K), and throughput stays within 20%. One knob. Turn it down. 1K to 5K chunks gives you predictable latency without giving up throughput.

The baseline config wins. sync=NORMAL, default cache, pool=5, chunk=5000. No exotic PRAGMAs needed. This matches the production consensus across OneUptime, Forward Email, Litestream, Rails 8, and SQLite's own documentation. It is also what I would have shipped if I had skipped the benchmarks entirely. Annoying, but useful to know.

Raw SQL degrades gracefully at scale. 87,893 r/s at 10M, 65,742 r/s at 50M, a 25% drop. This is SQLite's actual I/O scaling curve: B-tree depth, page splits, WAL checkpoints. Real, predictable, manageable. The ORM masks it entirely because the ORM never gets close to the I/O boundary.

QueuePool eliminates concurrency errors. Zero errors across 110 million rows, 11 configurations, 2 paths, 2 scales. QueuePool serialises writes at the application layer, matching the single-writer pattern every production deployment converges on. The fact that I never saw a database is locked is itself a result.

References

Production case studies

Expensify, Scaling SQLite to 4M QPS on a Single Server. 10B rows, 192 cores, custom Bedrock layer.
Ben Johnson / Litestream, All-In on Server-Side SQLite. 10–20 µs per query, 50–100× faster than intra-region Postgres.
DHH / 37signals, Rails 8: No PaaS Required. SQLite as Rails 8 default, per-customer DB pattern.
Kent C. Dodds, Why You Should Probably Be Using SQLite. Postgres cluster to distributed SQLite migration.
Cloudflare D1, SQLite at the Edge After 6 Months. P99 8 ms reads, 40–60% latency reduction.
OneUptime, SQLite Production Setup. Production configuration with monitoring thresholds.
Turso, Embedded Replicas. 624 µs read latency, 40 µs connection time.

Benchmarks and technical analysis

Shivek Khurana, SQLite in Production: A Real-World Benchmark. WAL vs DELETE, sync modes, concurrency scaling.
Anders Murphy, 100K TPS Over a Billion Rows. Dynamic batching, SQLite vs Postgres.
Marending, How Fast Is SQLite?. Cross-platform write throughput (46K–113K w/s).
tenthousandmeters.com, SQLite Concurrent Writes. Multi-threaded scaling, app-level mutex pattern.
Evan Schwartz, Your SQLite Connection Pool Might Be Ruining Your Write Performance. Single-writer 23× faster than 50-connection pool.
Forward Email, SQLite Performance Optimization PRAGMA Guide. sync=OFF no better than NORMAL; temp_store=MEMORY can be slower.
Stephen Margheim, SQLite on Rails: Optimal Performance. IMMEDIATE transactions, uniform busy handler retry.

SQLite documentation

SQLite Official, Appropriate Uses for SQLite.
SQLite Official, Write-Ahead Logging.
SQLite Official, Speed Comparison. Transaction batching is the 10–20× improvement.

ORM performance

SQLAlchemy, Performance FAQ.
SQLAlchemy Bulk Insert Benchmarks, Fast Bulk Insert. Core insert 40× faster than session.add().

Community reports

Hacker News, SQLite in Production Discussion. Multiple production deployments, 8-figure ARR on SQLite.

About this work

I started running these benchmarks while building a high-throughput spatial data ingestion pipeline on SQLite, on edge hardware, with structured measurements arriving by the million. The ORM ceiling was the forcing function. I needed to know whether I could fix it with PRAGMA tuning or whether I had to bypass SQLAlchemy entirely.

The answer turned out to be the latter, and the benchmark suite I built to get there became this repo.

Hardware context matters, of course. My NVMe results will differ from EBS, spinning disk, or eMMC. The relative findings (the ORM ratio, the config irrelevance, the chunk size vs latency story) should hold across storage tiers. The absolute numbers will not. If you run this on different hardware, open an issue, and I'll add your numbers to the comparison table.

What happens when you send an Ethernet frame? I wrote 289 lessons to find out.

Tanay Kedia — Sat, 23 May 2026 15:29:01 +0000

Every networking course I found had the same problem.

Either it's pure theory — OSI model diagrams, "TCP does a three-way handshake," done. Or it skips straight to socket() and connect() and treats everything underneath as a black box.

Nothing showed me what actually happens between the bits on a wire and curl https://example.com returning a response.

So I wrote 289 lessons that build the entire network stack from scratch in C.

What "from scratch" means

You don't call libraries. You construct the bytes yourself.

Here's how you build an Ethernet frame in the course:

int nfs_frame_build(const uint8_t *dst, const uint8_t *src,
                    uint16_t ethertype, const uint8_t *payload,
                    size_t payload_len, uint8_t *out, size_t out_sz) {
    size_t eff_payload = payload_len;
    if (eff_payload < NFS_ETH_MIN_DATA)
        eff_payload = NFS_ETH_MIN_DATA;

    size_t total = NFS_ETH_HLEN + eff_payload;
    memset(out, 0, total);

    /* Destination MAC */
    memcpy(out, dst, NFS_ETH_ALEN);
    /* Source MAC */
    memcpy(out + NFS_ETH_ALEN, src, NFS_ETH_ALEN);
    /* EtherType in network byte order */
    uint16_t et_net = htons(ethertype);
    memcpy(out + 12, &et_net, 2);

    if (payload && payload_len > 0)
        memcpy(out + NFS_ETH_HLEN, payload, payload_len);

    return (int)total;
}

6 bytes destination MAC. 6 bytes source MAC. 2 bytes EtherType. Payload. That's an Ethernet frame. You built it. You understand it.

Same approach for everything else — ARP, IP, ICMP, TCP, DNS, TLS.

The 15 phases

The course is organized bottom-up:

Phases 1-3: Bits → Frames → Packets
You start with encoding schemes and bit manipulation, build Ethernet frames with raw sockets, then implement IP headers with proper checksums, write an ICMP ping, and build a router.

Phase 4: TCP (40 lessons)
This is the big one. You implement the full TCP state machine — three-way handshake, sliding window, retransmission, SACK, congestion control (Reno, CUBIC, BBR), TIME-WAIT. 40 lessons because TCP is that deep.

Phases 5-7: Sockets, Protocols, TLS
You build an HTTP parser, a DNS resolver, implement the TLS 1.3 handshake including Diffie-Hellman key exchange, AES-GCM encryption, and X.509 certificate parsing.

Phase 8: The capstone
Everything comes together. You build a complete userspace TCP/IP stack in C that can make HTTPS requests. Your code, your stack, real traffic.

Phases 9-14: Go deeper
Linux kernel networking internals (sk_buff, NAPI, netfilter), eBPF/XDP programs, container networking, building a CNI plugin, service mesh. The stuff you debug at 3am in production but never learned properly.

Phase 15: DDS & Robotics
Bonus phase covering DDS middleware used in ROS 2 and real-time systems.

What a lesson looks like

Each lesson has:

A README explaining the concept
C (or Python) source code
A Makefile
Tests
Exercises

You clone the repo, cd into a lesson, run make, and start building. No IDE required, no special tools. Just gcc and a Linux machine.

Here's DNS name encoding from the DNS lesson — turning example.com into the wire format that RFC 1035 specifies:

int nfs_dns_name_encode(const char *name, uint8_t *out, size_t out_sz) {
    size_t pos = 0;
    size_t i = 0;
    size_t name_len = strlen(name);

    while (i < name_len) {
        const char *dot = strchr(name + i, '.');
        size_t label_len = dot ? (size_t)(dot - (name + i))
                               : name_len - i;

        if (label_len == 0) { i++; continue; }
        if (label_len > 63) return -1;

        out[pos++] = (uint8_t)label_len;
        memcpy(out + pos, name + i, label_len);
        pos += label_len;
        i += label_len + (dot ? 1 : 0);
    }
    out[pos++] = 0; /* root terminator */
    return (int)pos;
}

example.com becomes \x07example\x03com\x00. Length-prefixed labels, null-terminated. You read the RFC, you write the code, you understand the format.

Why build instead of read

I used to think I understood TCP. I'd read the textbook diagrams, I could explain the three-way handshake on a whiteboard.

Then I tried to implement a sliding window with SACK support and realized I understood almost nothing. What happens when an ACK arrives for a segment you already retransmitted? How do you handle window scaling across 32-bit sequence space? What does CUBIC actually do differently from Reno?

Building forces you to answer questions that reading lets you skip.

The repo

289 lessons. 15 phases. MIT licensed. No signup, no paywall.

GitHub: github.com/TanayK07/networking-from-scratch

Site: networkingfromscratch.vercel.app

If you've ever wanted to understand what actually happens between your application and the wire, this is the course I wish existed when I started.

Contributions welcome — especially reviews of the kernel internals and eBPF phases. Open an issue or PR on GitHub.

I Built a Pokédex for AI Coding Companions

Tanay Kedia — Sat, 04 Apr 2026 12:15:15 +0000

The Idea

Claude Code has a /buddy feature — it gives you a random AI companion with ASCII art, a name, a personality, and stats. It's cute. It sits in your config file. Nobody else ever sees it.

I thought: what if we made it competitive?

What I Built

Buddy Board — a competitive leaderboard and trading card system for Claude Code companions.

One command to join:

npx buddy-board

https://buddyboard.xyz/og-image.png

How It Works

Your buddy is deterministic. It's computed from a hash of your Claude Code account ID using a seeded Mulberry32 PRNG — the same algorithm Claude Code uses internally. Your species, rarity, stats, eyes, and hat are all derived from this hash.

That means your buddy is truly yours. Same account, same buddy, every time.

The Stats

Every buddy has 5 stats (0-100): Debugging, Patience, Chaos, Wisdom, Snark. These are summed for a total score that determines your leaderboard rank.

Rarity

┌───────────┬────────┬──────────────────────────────────────────────┐ │ Tier │ Chance │ Visual Treatment │ ├───────────┼────────┼──────────────────────────────────────────────┤ │ Common │ 60% │ Clean border │ ├───────────┼────────┼──────────────────────────────────────────────┤ │ Uncommon │ 25% │ Green border │ ├───────────┼────────┼──────────────────────────────────────────────┤ │ Rare │ 10% │ Blue glow │ ├───────────┼────────┼──────────────────────────────────────────────┤ │ Epic │ 4% │ Purple glow │ ├───────────┼────────┼──────────────────────────────────────────────┤ │ Legendary │ 1% │ Gold pulse + holographic shimmer + scanlines │ └───────────┴────────┴──────────────────────────────────────────────┘shell

And there's a 1% chance of being Shiny on top of rarity. A Shiny Legendary is a 0.01% pull.

The Species

18 species, all rendered as ASCII art: duck, goose, blob, cat, dragon, octopus, owl, penguin, turtle, snail, ghost, axolotl, capybara, cactus, robot, rabbit, mushroom, chonk.

Each species has swappable eyes (6 types) and hats (8 types), giving 1,728 possible visual combinations. The BuddyDex tracks which ones have been discovered globally.

Features

Global leaderboard — ranked by total stats, filterable by species, rarity, org
Trading cards — rarity-specific CSS treatments, embeddable in GitHub READMEs
BuddyDex — Pokédex-style gallery, undiscovered species show as silhouettes
Organizations — register your team, compete on combined power
Compare — head-to-head stat breakdowns between any two buddies
OG cards — auto-generated 1200x675 share images

The Holographic CSS

The part I'm most proud of is the legendary card treatment. It uses three layered CSS effects:

Scanlines — repeating-linear-gradient at 2px intervals, 3% opacity (::before)
Holographic shimmer — rainbow gradient sweep via background-position animation on ::after
Gold pulse — breathing box-shadow keyframes (3s ease-in-out infinite)

Simple CSS, big visual impact.

Try It

bash npx buddy-board

Then add your card to your GitHub README:

markdown [![buddy](https://buddyboard.xyz/card/yourname)](https://buddyboard.xyz/u/yourname)

GitHub: TanayK07/buddy-board
Live: buddyboard.xyz

I'd love to know what species you got. Mine is a Common duck. I'm still waiting for the reroll feature.