TIL - Picking a Database by Its Read/Write Pattern

#software #database

I've shipped systems on Postgres, BigTable, and a column store. If you asked me which one to reach for, I could answer from experience. But until I read DDIA Chapter 3, I couldn't have told you why they behave so differently, because I'd never actually thought about what the storage engine does with the bytes underneath.

That turned out to be the whole lesson. The internal data structure of a database isn't an implementation detail we can ignore. It's designed and tuned around a read/write access pattern. So once we understand the structures, "which database" stops being a vibe and starts being a consequence of how our workload reads and writes.

Two worlds: OLTP and OLAP

DDIA splits database workloads into two shapes. These aren't database types, they're descriptions of how a workload behaves:

OLTP (Online Transaction Processing): point reads/writes, low latency, high concurrency, few rows touched per query, needs fresh data. (The app serving user requests)
OLAP (Online Analytical Processing): large scans, few columns across many rows, throughput over latency, bulk/batch writes, tolerates staleness. (The analytics and reporting)

The thing that finally clicked for me: these labels are something we put on a workload after we've described it, not a bucket we pick first. And one system usually has both. The write path can be analytical-shaped while the read path is transactional-shaped.

Picking an engine is asking two questions about the workload

If the storage engine is tuned to the access pattern, then choosing one is really about describing how our workload reads and writes, then matching it. Two questions get us most of the way there.

One quick note before that. I'm only scratching the surface of each technology here. This is what clicked for me, not a deep dive into compaction internals or page layouts. If you want the full mechanics, the book itself (and plenty of good articles) go far deeper. Treat the names below as starting points to go read about, not as the last word.

Q1: write-heavy or read-heavy? (LSM vs B-tree)

Inside the OLTP world, there are two ways to organize data on disk, and the difference is entirely about how each one handles a write.

Log-structured (LSM-tree). New and updated data is appended. Writes land in an in-memory sorted structure (the memtable), which gets flushed to disk as an immutable, sorted file (an SSTable). Background compaction merges those files and throws away superseded values. Because every write is a sequential append, write throughput is high. The cost is read amplification, where a key might live in several SSTables, so reads check multiple files (Bloom filters exist to skip the ones that definitely don't have the key). Databases built this way: Cassandra, RocksDB, LevelDB, HBase, and BigTable.

Update-in-place (B-tree). Data lives in fixed-size pages, and an update rewrites the page where the key already is. Reads are predictable (a bounded number of page lookups), and this structure fits transactions cleanly. The cost is write amplification: writes are random I/O, and a single update often rewrites a whole page (plus the write-ahead log entry, and sometimes a page split). Databases built this way: Postgres, MySQL (InnoDB), and most traditional relational databases.

So the heuristic "LSM for write-heavy, B-tree for read-heavy or transactional" isn't a rule to memorize. It falls out of how each structure treats a write. That's the thesis in miniature: the engine is shaped by the access pattern. This is why Cassandra can absorb a firehose of writes while Postgres gives us clean transactions. They picked different sides of this tradeoff.

Q2: point access or analytical scan? (and why "row vs column" is the wrong axis)

This is the part that reframed how I think about it.

I used to assume the storage axis was "row-oriented vs column-oriented." After this chapter I think that's the wrong axis. The real distinction is whether our storage unit is explicitly keyed or positionally implied.

In a row store like Postgres or MySQL, a row is one keyed record with all its columns fused together. Update one column, and it rewrites the whole record.
In a wide-column store like BigTable, Cassandra, or HBase, the unit isn't row -> all columns. The on-disk key is the full cell coordinate (row key, column, timestamp) -> value. Each cell is independently keyed. A row's cells are sorted to sit next to each other (that's the only thing "row-oriented" really means here, just locality), but each cell is written and updated on its own.
In a column store like Parquet, ClickHouse, Vertica, or BigQuery, each column is a separate file of bare values, with no key stored next to each value. Row N is whatever sits at position N in every file. Position is the implicit key.

For example, let's say we have these rows:

row A: impression=11, click=12, cost=13
row B: impression=14, click=15, cost=16
row C: impression=17, click=18, cost=19

The storage visualization for each type is:

That single distinction explains the write behavior the orientation framing can't:

A row store updates one column by rewriting the whole row, since the columns are fused into a single keyed record. Cheap when you're touching one row at a time, but two writers updating different columns of the same row still contend on the same record.
A wide-column store can update one column of one row without touching the others, because the cell is its own keyed thing. Independent writers writing different columns to the same row never collide.
A column store can't cheaply insert one row, because there's no key to address it by. Position ties every value to its row, so inserting means realigning every file (and the compressed, sorted columns reject cheap in-place edits). That same positional layout is exactly what makes scans and compression spectacular.

Strength and weakness come from the same design decision. Keyed buys us independent writes and locality. Positional buys us compression and scan speed. Neither is "better." Each is tuned to a different access pattern.

The read side mirrors it:

A row store fetches one whole record in a single read, but scanning one column means dragging every full row off disk.
A wide-column store reads one row as a contiguous scan over its adjacent cells, but it's still reading row by row, not scanning one column across everything.
A column store reads one column across millions of rows by touching just that file, but reassembling one whole row means gathering a value from every file.

Point access wants keyed; analytical scan wants positional. Read and write pull the same direction, because they're the same access-pattern bet.

Where I'd been relying on this without naming it

A system I worked on stored predictions for an ads insights platform: three prediction types (impression uplift, click uplift, cost saving), produced by several suggestion engines, kept in BigTable. We did reason our way there from the read/write pattern. What this chapter gave me was sharper vocabulary for the thing we were already doing.

Run the workload through the two questions:

Writes were high-volume and batch: several engines aggregating offline, then loading results in. That's append-heavy ingest, the LSM side of Q1.
Reads were point and prefix lookups on a composite key like shop#group (read one group, or a range of groups under a shop). Not analytical scans across one column, so keyed, not positional, on Q2.

Append-heavy writes, keyed point and prefix reads. That shape fits a wide-column store like BigTable almost exactly, so that's where we landed. None of it is "row vs column" or "SQL vs NoSQL." It's just the workload described honestly, and the engine fell out of the description.

The fit wasn't perfect, either. Some read paths still needed online aggregation, which traded read latency for write-path simplicity, but that's a story for another post.

The part the access pattern doesn't decide

Here's what I only noticed reading this chapter. We assessed the read/write pattern, it pointed at BigTable, and we stopped there. I never actually asked the next question: could Postgres have done this too?

The honest answer is, probably yes.

Whole-row update contention from multiple writers? Postgres has a workaround. Split each writer's columns into separate tables keyed on (shop, group). No shared row, no contention.

Avoid joins on large tables? That join only exists because I split the tables. Keep one wide table and there's no join at all. And even the split join is cheap when it's indexed on a bounded set of rows.

The data too big for one machine? It wasn't, at the time. It fit comfortably on a single node.

So the thing that pointed me at BigTable, the access pattern, wasn't actually the thing that ruled Postgres out. Postgres could have served the same reads and writes. I just never ran the comparison, because I'd found something that fit and didn't look back.

That's the lesson I took from it. The access pattern narrows you to an engine class, and that narrowing is real and useful. But it doesn't rule out the alternatives within reach of that class. What actually separates BigTable from Postgres here is how each one grows past a single machine. Postgres is single-node by design: scaling out means sharding it yourself, picking a shard key, routing queries in the application, and rebalancing by hand as it grows. BigTable partitions automatically, splitting data into tablets by row-key range and spreading them across machines on its own. So once the data outgrows one node, Postgres turns into an operational project and BigTable just keeps going. That's the real dividing line, and it's the question I never got around to asking out loud.

Takeaway

So the honest shape of database selection, after this chapter:

The read/write access pattern selects the engine class: LSM vs B-tree, keyed vs positional. This part is real and reasoned, and it's what I actually did.
Scale and transactional needs finish the selection, and these are independent of the access pattern. Two stores can match the access pattern perfectly and differ entirely on whether they shard or do multi-row ACID. This is the step it's easy to skip once you've found something that fits.

The data structure follows the access pattern. The product follows the data structure plus scale.

This is the storage-engine layer of the decision. Replication, partitioning, and consistency (later chapters in the book) add their own factors, but those sit on top of this layer, they don't replace it.