DEV Community

Haji Rufai
Haji Rufai

Posted on

I Built a Columnar File Format in Pure Python — a tiny, readable Parquet

A while back I caught myself in an interview saying "yeah, I use Parquet a lot" — and then realized I couldn't actually explain why it's faster than a CSV beyond hand-waving about "columnar." That bugged me. So I did the thing that always fixes this for me: I rebuilt it from scratch.

The result is Columna, a columnar storage engine and file format written in pure Python. No pandas, no pyarrow, no numpy. Just struct and zlib from the standard library. It's about 3,000 lines, it has 81 tests, and on a 50,000-row dataset it ends up 91% smaller than the equivalent CSV while reading 98% fewer bytes to answer a filtered query.

Here's what I learned building it.

Why columnar actually wins

A CSV is row-major. The bytes for row 1 come first, then row 2, and so on. That layout is great for "give me this whole record," but it's terrible for analytics. If you want the average of one column, you still have to read every byte of every other column just to step over them. There's no way to skip.

Columnar storage flips this. All the values for amount are stored together, then all the values for region, and so on. Three things fall out of that one decision:

First, compression gets dramatically better, because values in a single column are similar to each other — same type, often similar magnitude, frequently repeated. A column of four country codes compresses far better than a row mixing an integer id, a float, and three strings.

Second, you can read just the columns a query needs. Want two columns out of seven? Read two column chunks, skip the rest. That's projection pushdown.

Third — and this is the one that surprised me how much it matters — if you store min/max stats per chunk, you can skip whole blocks of rows without decoding them. That's predicate pushdown, and it's where the real speedups live.

The file format

I went with a footer-last layout, same idea as Parquet. The writer streams row groups out as it goes and only writes the metadata at the very end:

MAGIC "CNA1"
Row Group 0
  Column Chunk "order_id"  -> pages
  Column Chunk "region"    -> pages
  ...
Row Group 1 ...
FOOTER (zlib-compressed JSON: schema, offsets, encodings, min/max)
FOOTER_LEN (uint32) + MAGIC
Enter fullscreen mode Exit fullscreen mode

Why footer-last? Because when you're writing, you don't know a column chunk's compressed size or its min/max until you've seen all the rows. Putting metadata at the front would mean buffering the entire file in memory or seeking backwards. Writing it at the end lets you stream. The reader just seeks to the last few bytes, grabs the footer length and magic, inflates the footer JSON, and now it has a map of the entire file: every column chunk's byte offset, which encoding it used, and its min/max. All of that before reading a single value. That map is the whole trick.

A row group is just a horizontal slice of rows — say 5,000 at a time. Each row group holds one chunk per column, and each chunk is split into self-describing pages with a tiny header (magic byte, encoding id, codec id, value count, null count) and a CRC32 checksum. The checksum caught two of my own bugs during development, so I'm keeping it.

Five encodings, and letting the data decide

This was my favorite part. There are five encodings:

  • PLAIN — varint integers, IEEE-754 floats, length-prefixed strings. The fallback.
  • DICTIONARY — for low-cardinality columns. Store the distinct values once, replace the column with small integer indices, then bit-pack those indices.
  • RLE — run-length encoding for long stretches of the same value.
  • DELTA — for monotonic data like ids and timestamps. Store the first value, then zig-zag varints of each gap. A sequence like 100000, 100001, 100003 becomes one full integer plus single-byte deltas.
  • BITPACK — subtract the column minimum and pack the leftovers into the fewest bits that fit. Quantities of 1 through 12 need 4 bits each, not a full byte.

Now, how do you pick the right one? Early on I wrote a clever heuristic — "if cardinality is low use dictionary, if it looks sorted use delta" — and it kept choosing badly on edge cases. So I threw it out and did the dumb thing that turned out to be the right thing: the writer actually encodes every column with every applicable encoding, measures the bytes, and keeps whichever is smallest. Then it runs an optional DEFLATE pass and keeps that only if it shrinks further.

It's more work at write time, but it can never pick something worse than PLAIN, and the encoding id lives in the page header so the reader stays completely generic. I'd rather burn write CPU than ship a format that sometimes makes files bigger.

def choose_encoding(values, dtype):
    best_enc, best_buf = Encoding.PLAIN, encode_with(Encoding.PLAIN, values, dtype)
    for enc in applicable(dtype):
        buf = encode_with(enc, values, dtype)
        if len(buf) < len(best_buf):
            best_enc, best_buf = enc, buf
    return best_enc, best_buf
Enter fullscreen mode Exit fullscreen mode

Nulls are handled with a definition bitmap — one bit per row marking which positions are present. Only non-null values go into the value stream. An all-null column costs about one bit per row and zero value bytes; a column with no nulls skips the bitmap entirely.

Predicate pushdown without lying about it

Here's the part I actually cared about getting right. When you run scan(where=col("amount") > 4500), the engine walks the row groups and, for each one, checks the footer stats. If a group's max amount is 4500 or less, it cannot possibly contain a matching row, so the engine skips it — no decode, no read.

def can_skip(self, stats):
    mn, mx = stats[self.column]["min"], stats[self.column]["max"]
    if mn is None or mx is None:
        return False              # unreliable stats -> never skip
    if self.op == ">":  return mx <= self.value
    if self.op == "<":  return mn >= self.value
    if self.op == "==": return self.value < mn or self.value > mx
Enter fullscreen mode Exit fullscreen mode

The if mn is None line matters more than it looks. The cardinal sin of a pushdown engine is dropping a row that should have matched. If stats are missing, or polluted by a NaN, the only safe move is to refuse to skip and fall back to a full scan of that group. Slower, but correct. Correctness beats speed every single time here, and I wrote a test that hammers this: it generates random data, runs random predicates through the engine, and compares the result against a brute-force Python filter. If pushdown ever changed the result set, that test would scream. It doesn't.

Every scan returns a ScanResult that reports rows_scanned, row_groups_skipped, and bytes_read, so the speedup is something you can measure instead of something I claim.

The numbers

I built a benchmark script that pits Columna against plain CSV on the same 50,000-row orders dataset:

operation CSV Columna
File size 2,592 KB 225 KB (91% smaller)
Read 1 of 7 columns 2,592 KB 154 KB (94% less)
Filter amount > 4500 2,592 KB 47 KB (98% less)

For that last filter, Columna touched only 3 of 10 row groups — it skipped 7 entirely using nothing but the footer min/max stats. A CSV has no choice but to read the whole file, every time, for every query.

What I'd tell my past self

Two things stuck with me. One: the "try everything and measure" approach beat my clever heuristic every time, and it was less code. Two: a file format is mostly an exercise in being honest about offsets and lengths. Once the footer correctly describes where every byte lives, projection and predicate pushdown are almost free — they're just decisions about which byte ranges to not read.

It's a teaching-grade engine, not a production store — single file, single thread, no nested types, no schema evolution. But every byte on disk is explained by readable Python, and I can finally answer that interview question properly.

If you want to poke at it, the live inspector renders a real .cna file's row groups and encodings in the browser: https://hajirufai.github.io/columna/report.html — and the source is at https://github.com/hajirufai/columna.

Top comments (0)