Haji Rufai

Posted on Jun 7

I built a mini columnar storage engine in pure Python — and finally understood why Parquet is so fast

#dataengineering #python #database #performance

Last month a "fast" reporting query at work was taking 40 seconds. The table had 90 columns. The query touched 3 of them. We were reading the entire row, every byte, just to throw 97% of it away.

That's the moment columnar storage finally clicked for me — not as a buzzword, but as a thing I could feel in a slow dashboard. So over a weekend I built a tiny columnar store in pure Python to actually understand what Parquet and Arrow are doing under the hood. No pandas, no pyarrow. Just struct, bytes, and a notebook full of scribbles.

Here's what I learned, with the code that taught me.

Rows vs columns: same data, different bytes

Say you have this data:

rows = [
    {"user_id": 1, "country": "KE", "amount": 12.5},
    {"user_id": 2, "country": "KE", "amount": 8.0},
    {"user_id": 3, "country": "NG", "amount": 19.9},
]

A row store writes it like this on disk:

[1,"KE",12.5][2,"KE",8.0][3,"NG",19.9]

To sum amount, you have to walk past user_id and country on every single row. The disk reads them whether you want them or not.

A column store flips it:

user_id:  [1, 2, 3]
country:  ["KE", "KE", "NG"]
amount:   [12.5, 8.0, 19.9]

Now SELECT sum(amount) reads exactly one contiguous block and ignores the rest. That single rearrangement is 80% of the magic. The other 20% is what you can do because values of the same type now sit next to each other.

Building the writer

Let me build a file format that stores each column separately. First, the layout. A column file will be: a small header, then one encoded block per column, then a footer that says where each column lives so we can seek straight to it.

import struct, json

MAGIC = b"COLLITE1"

def write_table(path, columns: dict):
    # columns = {"user_id": [1,2,3], "amount": [12.5,...], ...}
    offsets = {}
    with open(path, "wb") as f:
        f.write(MAGIC)
        for name, values in columns.items():
            offsets[name] = {"start": f.tell()}
            block = encode_column(values)
            f.write(struct.pack("<I", len(block)))
            f.write(block)
            offsets[name]["length"] = len(block) + 4
        # footer: JSON map of column -> offset, then its length
        footer = json.dumps(offsets).encode()
        f.write(footer)
        f.write(struct.pack("<I", len(footer)))

The footer-at-the-end trick is exactly what Parquet does. You write data first (streaming, you don't need to know total size up front), then write metadata, then write the metadata length as the very last 4 bytes. A reader opens the file, jumps to the end, reads those 4 bytes, then reads the footer, and now knows where every column starts — without scanning the file.

Encoding: where columns earn their keep

Here's the part that made me appreciate the format. Because a column holds one type and often repeats values, you can compress it cheaply before any general-purpose compressor touches it. I implemented three encodings and let each column pick the smallest.

Dictionary encoding (for low-cardinality strings)

country has 3 rows but only 2 distinct values. Storing the strings repeatedly is wasteful. Instead, build a dictionary and store integer codes:

def encode_dict(values):
    seen = {}
    codes = []
    for v in values:
        if v not in seen:
            seen[v] = len(seen)
        codes.append(seen[v])
    dictionary = list(seen.keys())     # ["KE", "NG"]
    # codes -> [0, 0, 1]
    return dictionary, codes

Now "KE","KE","NG" becomes [0,0,1] plus a 2-entry dictionary. On a column with millions of rows and 50 distinct countries, you go from storing long strings to storing tiny ints. This is also why columnar engines can filter country = 'KE' so fast — they resolve 'KE' to code 0 once, then scan a tight array of integers.

Run-length encoding (for sorted or repetitive data)

If the data is sorted or clumpy, store runs instead of values:

def encode_rle(values):
    runs = []
    for v in values:
        if runs and runs[-1][0] == v:
            runs[-1][1] += 1
        else:
            runs.append([v, 1])
    return runs   # [["KE",2],["NG",1]]

A column that's KE for a million rows becomes a single ["KE", 1_000_000] pair. Time-partitioned data (everything from the same day) compresses absurdly well this way.

Delta encoding (for monotonic integers)

user_id climbs: 1001, 1002, 1003... Storing the gaps instead of the absolute values keeps the numbers tiny, which then bit-packs nicely:

def encode_delta(values):
    out, prev = [], 0
    for v in values:
        out.append(v - prev)
        prev = v
    return out   # 1001,1002,1003 -> 1001,1,1,1

Now the smart part — pick the winner per column:

def encode_column(values):
    candidates = []
    # try each strategy, measure serialized size, keep the smallest
    candidates.append(("raw", serialize_raw(values)))
    if all(isinstance(v, int) for v in values):
        candidates.append(("delta", serialize_delta(encode_delta(values))))
    if len(set(values)) < len(values) * 0.5:
        candidates.append(("dict", serialize_dict(*encode_dict(values))))
    candidates.append(("rle", serialize_rle(encode_rle(values))))

    tag, payload = min(candidates, key=lambda c: len(c[1]))
    return tag.encode().ljust(4, b" ") + payload

Each column independently chooses its best encoding. That's why a real Parquet file can have one column dictionary-encoded, the next RLE, the next plain — all in the same file. There is no single right answer; the data decides.

Reading: only pay for what you select

The payoff. To answer SELECT amount FROM t, the reader touches only the amount block:

def read_column(path, name):
    with open(path, "rb") as f:
        f.seek(-4, 2)                       # last 4 bytes
        flen = struct.unpack("<I", f.read(4))[0]
        f.seek(-(4 + flen), 2)
        offsets = json.loads(f.read(flen))  # column -> {start, length}

        info = offsets[name]
        f.seek(info["start"])
        size = struct.unpack("<I", f.read(4))[0]
        block = f.read(size)
        return decode_column(block)         # dispatch on the 4-byte tag

Two seeks, one read of one block. The other 89 columns never leave the disk. This is "column pruning," and it's the single biggest reason analytical queries on columnar files fly.

The other free win: predicate pushdown

Once data is encoded per column, you can store cheap per-block statistics — min and max — in the footer. Then a query like WHERE amount > 1000 can skip entire blocks without decoding them:

def block_can_match(stats, op, value):
    if op == ">":
        return stats["max"] > value   # skip block if max <= value
    if op == "<":
        return stats["min"] < value
    return True

If a block's max amount is 50 and you're asking for > 1000, you skip it entirely. On a year of partitioned data, a date filter ends up reading a handful of blocks instead of all of them. This is the same idea behind Parquet row-group statistics and the reason WHERE clauses on sorted columns feel almost free.

What I actually took away

A few things stuck with me after building this:

Columnar isn't one trick, it's three stacked wins. Layout (read fewer columns), encoding (same-type values compress hard), and statistics (skip blocks). Each multiplies the others.
The footer-at-the-end pattern is everywhere — Parquet, ORC, even some log formats. Write data streaming, write metadata last, write metadata length as the final bytes. Random access without scanning.
Encoding choice is per-column and data-dependent. Don't reach for a global "best" codec. Let each column measure and pick.
It explains the gotchas, too. Columnar is brutal for SELECT * row lookups and single-row updates — exactly why you don't run your transactional app on Parquet. Knowing why helps you pick the right store instead of cargo-culting.

I'm not suggesting anyone ship a homemade format — use Parquet, use Arrow, they're battle-tested and handle a hundred edge cases I skipped (nested types, nulls, encryption, compression codecs). But building the 200-line toy version turned columnar storage from a word I nodded along to into something I can reason about when a query is slow.

The 40-second report? Once I understood column pruning, the fix was obvious: stop SELECT *, store it columnar, partition by date. It now runs in under a second.

If you build data systems, I'd push you to build the toy version of whatever you depend on — a tiny query engine, a mini WAL, a baby columnar store. A weekend of "this is dumb and slow but it works" teaches more than a month of reading docs.

I write one of these build-from-scratch breakdowns regularly — caches, API gateways, secrets vaults, data-contract validators, and now this. Follow me here on dev.to if you want the next one, and tell me in the comments: what's the system you rely on every day but have never built yourself? That's probably your next weekend project.

DEV Community