Aleksandr Yershov

Posted on Jun 3

qdf: a Go serializer that decodes less, packs harder, and lets you query the bytes

#go #opensource #serialization #performance

TL;DR for the impatient. qdf is a schemaless Go serializer (struct tags, no .proto). On real batches it's up to 68% smaller than protobuf, decodes 4–9× faster than encoding/json, ships hand-written AVX2/NEON bit-packing at ~50 GB/s, and does one thing no other mainstream Go serializer does: it can run SELECT … WHERE … over a []byte and decode only the columns and rows you asked for. Pure Go, zero dependencies. github.com/alex60217101990/qdf

This is the engineering deep-dive, not the marketing page. We're going to look at actual hexdumps, the codec picker's never-larger guarantee, the twin-bitmask three-valued predicate engine, and a profiler-driven argument about why your decode path is slow for a reason you probably haven't measured. If you write Go services that serialize the same five shapes forever — logs, events, metrics, RTB bids, OTLP spans — this is for you.

The problem nobody's format actually solves

Every binary serializer makes you pick two of three:

	schemaless	small wire	fast / cheap
`encoding/json`	✅	❌	❌ (allocates a mountain)
msgpack	✅	⚠️ (per-record)	⚠️
protobuf / flatbuffers	❌ (`.proto` + codegen)	✅	✅

JSON is universal and schemaless and burns CPU and GC like it's free. msgpack is smaller but you still decode the whole blob to read one field. protobuf and flatbuffers are fast and compact — right up until you're maintaining .proto files and a codegen step for what used to be a plain struct.

qdf is an attempt to refuse the tradeoff: self-describing wire (decode straight into a struct, no schema), protobuf-class sizes on batches, genuinely extreme decode speed, and a columnar mode you can query. Let's see how, byte by byte.

type Event struct {
    TS    int64  `qdf:"ts"`
    Level string `qdf:"level"`
    Code  int32  `qdf:"code"`
}

b, _ := qdf.Marshal(events, qdf.OptBalanced) // []Event -> []byte
var back []Event
_ = qdf.Unmarshal(b, &back)

Struct tags name fields, exactly like json:. No registry, no generated types to keep in sync. The decoder figures out mode, codecs and compression from the wire itself — you never pass options to Unmarshal.

1. The wire format in one look

A qdf buffer is a 5-byte header + a tagged body. That's the whole envelope.

51 44 46   01    XX        [ tagged body … ]
'Q' 'D''F' ver  flags       bytes 5 … N

The flags byte is a tiny bitmap telling the decoder which dialect the body speaks, so it can fast-path or reject before parsing a single value:

FlagDense (0x01) — body uses the Dense intern dialect (back-reference tags).
FlagQPack (0x02) — body may carry the QPack numeric/bool codec tags.
FlagRANS (0x04) — body is rANS-compressed; decompress first.
FlagColIndex (0x08) — a columnar payload carries a per-column length index (this is what makes selective decode an O(1) skip).

The base tag space is msgpack-shaped — fixint, fixstr, fixarr, typed scalars, str/bin/arr/map in 8/16/32 widths, negfixint. On top of that sit the Dense back-reference tags and the QPack codec tags. That base layer is why a Fast-mode qdf buffer is about as small as msgpack and just as quick; the extra tags are where qdf pulls ahead on batches.

An actual buffer, byte for byte

Encode one &Event{TS:7, Level:"ERR", Code:500} with OptSpeed → 29 bytes, every one accounted for:

51 44 46 01 00              QDF, ver 1, flags 0x00 (Fast)
d5 03                       map, 3 fields
82 74 73 07                 "ts"  -> fixint 7
85 6c 65 76 65 6c 83 45 52 52   "level" -> fixstr "ERR"
84 63 6f 64 65 c4 f4 01     "code" -> uint16 0x01F4 (500)

Two details that tell you how the encoder thinks:

It picked the narrowest tag that holds the value. 500 went out as a 2-byte uint16, not a 4-byte int32. The picker always reaches for the smallest tag, per value.
There's no schema anywhere. The keys ts/level/code are in the bytes. That's the cost of being schemaless on a single message — and exactly what Dense mode erases on a batch.

Flip to OptBalanced on a slice of these and the repeated keys (ts/level/code) and repeated values ("ERR") collapse to 1-byte back-references after first sight. Which brings us to the encoder.

2. Encode: it measures, then packs

qdf doesn't pick one scheme and pray. The encode pipeline:

value → typeDesc cache → columnar transpose → per-column codec picker
      → Dense intern → rANS (opt-in) → []byte

Reflection runs once per type, ever. The first call for a type builds a type descriptor — a flat array of encode/decode closures over unsafe field offsets — and caches it in a sync.Map. Every later call touches only those closures: no reflect.Value churn, no per-field type switch on the hot path.

The codec picker and the never-larger rule

For every numeric/bool slice the encoder runs a cheap bounded probe and emits the smallest of a family. The comparison includes the raw form, so if nothing wins it falls back — turning compression on can never inflate a slice. This "never-larger by construction" property is the whole reason you can flip OptBalanced on blindly.

codec	idea	wins on
FOR	store `value − min`, bit-pack to width of `max−min`	bounded ranges (HTTP codes 200–504 → ~10 bits, not 32)
Delta+FOR	FOR over consecutive differences	monotonic-ish columns: timestamps, IDs, offsets
RLE	`(value, run-length)` pairs	long runs: status, enum, sparse flags
Dictionary	distinct table + bit-packed indices (`ceil(log2 d)` bits/row)	low cardinality, incl. string columns (level, region)
Patched FOR	FOR + an exception list for outliers	mostly-narrow columns with a few spikes

Delta+FOR, with the actual bytes

Take []int64{1000, 1001, …, 1009} — ten 8-byte integers, 80 bytes raw. Marshal(ints, OptQPack) gives 12 bytes total:

00000000  51 44 46 01 02 e6 07 00  d0 0f 02 0a   |QDF.........|

Header is 5 bytes (flags 0x02 = QPack), so the body is 7 bytes for ten int64s. Codec 0xE6 = Delta+FOR: it stored the first value, the minimum delta, and the residual deltas bit-packed. Since every delta is exactly 1, the residuals collapse to almost nothing.

That's the mechanism behind the headline 512× compression on monotonic timestamp vectors — a clock column is the perfect case: large absolute values, tiny constant deltas.

SIMD bit-packing — same wire, faster code

The bit-pack/unpack kernels are hand-written assembly: AVX2 on amd64, NEON on arm64, and they emit byte-identical output to the scalar path. Tests assert scalar ≡ SIMD bit-for-bit. So -tags qdf_simd is purely faster, never a different wire — runtime CPUID gate, scalar fallback on anything without AVX2.

22–53× over scalar at byte-aligned widths
~50 GB/s unpack (memory-bound there, not compute-bound)

If you run OptBalanced/OptCompression over numeric data, this build tag is free money:

go build -tags qdf_simd ./...

Implementation note for the SIMD-curious: the decode kernels lean on VPMOVZX widen-loads and VPBROADCASTQ+VPSRLVQ variable-per-lane shifts (a per-offset shift table picks the bit offset for each lane); encode uses VPSHUFB byte-gather and VPSLLVQ+lane-OR. On arm64, several of those have no direct Plan9 mnemonic and get hand-encoded via WORD. It's the kind of code where "byte-identical to scalar" is a property you test, not hope for.

The four-layer Dense dialect (strings & structure)

Repeated strings and field names are where batch formats bleed. Dense mode stacks four mechanisms so the second occurrence of a value is nearly free. Take []string{"eu-west-1","eu-west-1","eu-west-1"} under OptBalanced — 19 bytes:

00000000  51 44 46 01 03 a3 e0 09  65 75 2d 77 65 73 74 2d  |QDF.....eu-west-|
00000010  31 e8 e8                                          |1..|

bytes	meaning
`51 44 46 01 03`	header, flags `0x03` (Dense \| QPack)
`a3`	fixarr, 3 elements
`e0 09 65…31`	1st value: intern declaration — tag + len 9 + `"eu-west-1"`
`e8`	2nd value: one-byte back-reference
`e8`	3rd value: one byte again

First "eu-west-1" costs 11 bytes; each repeat costs 1. That's the whole game on telemetry, where region/service/level repeat across thousands of rows. The four layers producing those one-byte refs:

Intern table — first sight stored, assigned an id; later sights become a varint reference.
Move-to-front — the hot set resolves in 1–2 bytes via a small MRU ring (recent values get the shortest codes).
Markov-0 "same as last" — a value equal to the previous one is a single repeat tag (the e8 above).
Markov-1 pair predictor — if "GET" is usually followed by "/health", the predicted successor collapses too.

Floats get Gorilla (lossless XOR coding over math.Float64bits — bit-exact for NaN/±Inf/−0.0, never ==) and ALP (decimal-mantissa for quantized metrics/prices, with an exception list for anything that doesn't round-trip exactly). The opt-in order-0 rANS pass is the final never-larger squeeze for cold storage.

The structural win (and the gotcha)

Here's why qdf lands smaller than protobuf on real batches: it dedups and compresses across records. protobuf, msgpack, json and flatbuffers encode each record independently, so a repeated string or a smooth float series re-pays its cost every single row. qdf pays once per batch.

Gotcha #1: that cross-record win needs a batch. On a single small message there's nothing to dedup, so OptBalanced ≈ OptSpeed ≈ msgpack in size — use OptSpeed there and skip the Dense bookkeeping.

Gotcha #2: the Dense wire embeds intern/shape ids that depend on emission order, so two semantically-equal payloads can differ byte-for-byte. If you hash or sign the bytes, encode with OptSpeed.

3. The headline: read less than the whole message

Hand qdf a []struct and it transposes rows into columns — think Parquet, but automatic and still self-describing. Each column then gets the codec that fits it: timestamps go Delta+FOR, an enum-ish level goes dictionary, a run-heavy code goes RLE.

rows ([]Event)              columns (each its own codec)
┌────┬───────┬──────┐       ┌──────────┬────────┬──────┐
│ ts │ level │ code │  →    │ ts ts ts │ level… │ code…│
│ …  │  …    │  …   │       │ Delta+FOR│  dict  │ RLE  │
└────┴───────┴──────┘       └──────────┴────────┴──────┘

With OptColumnIndex the encoder also writes, right after the shape declaration, a fixed-width index: one uint32 byte-length per column. That index is the key — it lets the decoder compute each column's start offset and jump straight past any column it doesn't need, without parsing a byte of it.

Querying the bytes

buf, _ := qdf.Marshal(events, qdf.OptBalanced|qdf.OptColumnIndex)

// "SELECT ts, code WHERE level='ERROR' AND code>=500" — over a []byte.
type Hot struct {
    TS   int64 `qdf:"ts"`
    Code int32 `qdf:"code"`
}
var hot []Hot
_ = qdf.Unmarshal(buf, &hot,
    qdf.Where("level", func(s string) bool { return s == "ERROR" }),
    qdf.Where("code",  func(c int32) bool { return c >= 500 }))

What the decoder actually does, in order:

Read the shape + column index. Now it knows where every column starts.
Filter columns — decode only the columns named in a predicate (level, code). Run each predicate across its whole column to produce a per-row bitmask.
Combine the masks (AND here) into the surviving-row set.
Project — for the columns Hot wants (ts, code), materialize values only at the surviving rows. level was read to filter, then dropped because Hot doesn't contain it. Every other column is skipped via the index — its bytes are never parsed.

The predicate engine: twin bitmasks + SQL three-valued logic

It isn't just AND-of-equals. And, Or, Not compose into a real predicate tree — and the tricky part is nullable columns: in SQL, a comparison against NULL is neither true nor false, it's UNKNOWN. qdf gets this right with twin bitmasks per node: a T mask (rows definitely true) and an F mask (rows definitely false). Anything in neither is UNKNOWN.

Leaf: run the predicate per present row → fills T; F = present &^ T (present-but-not-true). Absent (nil) rows land in neither — UNKNOWN, for free.
AND: T = T₁ & T₂, F = F₁ | F₂ (false if any child is false — even if another is unknown).
OR: T = T₁ | T₂, F = F₁ & F₂.
NOT: swap T and F (unknown stays unknown).

The final result keeps only rows in the root T mask — TRUE, never FALSE, never UNKNOWN — which is exactly SQL WHERE semantics.

A neat optimization: a subtree with no nullable leaves can't produce UNKNOWN, so qdf skips materializing its F mask entirely and treats "not true" as the complement — one fewer pass over the rows.

_ = qdf.Unmarshal(buf, &hot,
    qdf.Or(
        qdf.Where("level", func(s string) bool { return s == "ERROR" }),
        qdf.And(
            qdf.Where("code", func(c int32) bool { return c >= 500 }),
            qdf.Not(qdf.Where("level", func(s string) bool { return s == "DEBUG" })),
        ),
    ))

The predicate is called once per row against the native typed value — func(int32) bool, func(string) bool — with zero interface boxing. Pure projection without a filter is just Select("ts","code").

No mainstream Go serializer does this. json, msgpack, protobuf, gob — all decode the whole message before you can read one field. For "store a wide batch, read a few columns or filter rows later," qdf is the only one that reads less than everything.

Concretely, on a wide batch at low selectivity (i7-9750H):

~5× faster than full decode (projection)
~5× less memory than full decode
~2.5× faster than decode-everything-then-filter

When it applies: you need OptColumnIndex at encode time, a []struct batch, and flat-ish fields. The bigger and wider the batch and the more selective the query, the larger the win. It's the columnar-warehouse pattern brought to a plain Go []byte — no database, no schema. (It is not for single messages or streaming — that's the row-by-row half of the design.)

4. Decode: the fastest work is the work you skip

Here's the claim that should change how you think about serializer performance:

Profile any serializer's decode and the truth is the same: it's allocation-bound, not CPU-bound.

Run go test -memprofile on a string-heavy decode and look at -alloc_objects. On qdf's row path it's almost entirely one call: (*Decoder).ReadString — copying string bodies out of the buffer into owned Go strings. Tag walking, bounds checks, type dispatch — rounding error. So the levers that matter aren't clever ALU tricks. They're don't allocate and don't decode.

Lever 1 · Zero-copy decode

var out []Event
_ = qdf.Unmarshal(data, &out, qdf.WithNoCopy()) // strings alias data, no copy

WithNoCopy returns strings and byte slices that point into data instead of copying out. On a string-heavy batch: ~1.7× faster, 7000+ allocations collapse to 3 (the only one left is the output slice). The decoder is already pooled and its scratch buffers reused, so with aliasing there's essentially nothing left to allocate per value.

The catch is honest and it's in the name. The returned values are valid only while data stays alive and unmodified. The footgun:

func handler(w http.ResponseWriter, r *http.Request) {
    buf := pool.Get(); defer pool.Put(buf) // recycled!
    io.ReadFull(r.Body, buf)
    var msg Msg
    qdf.Unmarshal(buf, &msg, qdf.WithNoCopy())
    queue <- msg // msg.Field aliases buf … which is about to be reused → garbage
}

That's a use-after-free the race detector won't catch (it's not a data race — it's manual memory). So WithNoCopy is opt-in by design: perfect for read-and-discard over a buffer you own (a file, an mmap, a batch you process then drop), wrong for a pooled request body that outlives the call. Works on the reflection path, codegen, and streams.

Lever 2 · Decode in struct order

The encoder writes fields in struct-declaration order, so on decode the next wire field is almost always the next struct field. The decoder keeps a cursor and tries the expected field first — one string compare — before falling back to a map lookup. A profile of a wide-struct decode had ~40% of time in mapaccess1_faststr + the hash; the cursor removes that on the common path. The map stays as the fallback, so out-of-order, partial, and unknown fields still decode correctly — you just pay the lookup for the ones that actually arrive out of order.

Lever 3 · Lazy, pooled state

Decoders come from a sync.Pool, and their machinery — the intern table, scratch slices — allocates only on first use. A plain struct decode never touches the intern table, so it never pays for it. (Concretely: moving that table behind a lazily-allocated pointer cut a chunk of per-call overhead, because the codegen path builds a fresh decoder per nested value and was zeroing ~4 KiB of table it never used.)

Lever 4 · The biggest win: don't decode at all

Everything from §3 lands here too. Selective decode skips whole columns via the index and never rebuilds filtered rows. If your read pattern is "a few columns of a big batch," the fastest qdf decode is the one that touches almost none of the bytes. No micro-optimization beats not doing the work.

For the last drop: codegen

//go:generate qdfgen -type Event,Batch .

qdfgen emits concrete methods using only the public API — no reflect at runtime, no descriptor lookup. The generated decoder is a flat key switch (and it threads noCopy, so zero-copy works on generated types too):

func (v *Sample) UnmarshalQDFOpts(src []byte, noCopy bool) (int, error) {
    d := qdf.NewDecoderOnBuf(src)
    if noCopy { d.SetNoCopy(true) }
    n, err := d.ReadMapHeader()
    // …
    for i := 0; i < n; i++ {
        kb, _ := d.ReadStringBytes()
        switch string(kb) { // no alloc: compiler special-cases switch string([]byte)
        case "name": { rv, _ := d.ReadString(); v.Name = rv }
        case "age":  { rv, _ := d.ReadInt();    v.Age  = int(rv) }
        // …
        }
    }
}

On a fixed schema that's up to 8.5× faster decode than encoding/json.

And on encode, AppendMarshal hands you buffer ownership for zero per-call allocation:

out = out[:0]
out, _ = qdf.AppendMarshal(out, v, qdf.OptBalanced) // reuse your own buffer

The mental model: encode allocations are constant (a flat 3, output buffer pooled); decode allocations scale with how much you ask for. So the two levers that matter are alias-instead-of-copy (WithNoCopy) and ask-for-less (selective decode).

5. Benchmarks, and how they're measured

2019 i7-9750H, Go 1.26. Wire sizes are deterministic. Latencies are median of 6 runs; throughput claims use benchstat over ≥10 interleaved runs so a single warm/cold run can't lie. Everything reproducible from the bench/ module — a separate module so competitor deps (protobuf, vmihailenco/msgpack, flatbuffers) stay out of the core, which has zero dependencies:

cd bench
go test -run='^$' -bench Decode -benchmem -count=10 | tee new.txt
benchstat old.txt new.txt

Wire size vs the field (bytes, lower is better)

fixture	json	msgpack	protobuf	qdf balanced	qdf compress
OTLP 4×512	1 027 033	793 192	561 860	240 686	179 181
Logs 1024	245 037	193 476	156 479	89 631	62 149
RTB 1024	559 294	428 404	327 700	258 167	203 360
Events 1024	122 857	84 712	64 978	39 650	39 639
IoT 32×256	469 058	224 534	207 562	158 474	148 177

Smaller than protobuf on every batch: OTLP −68%, Logs −60%, Events −39%, RTB −38%, IoT −29%. Because qdf compresses across records and protobuf doesn't. That's the entire gap.

Throughput

workload	result
Decode vs `encoding/json`	4–9× faster across payloads (2–7× vs msgpack)
Numeric/bool slices (QPack)	5× smaller than json, 21× faster encode, 80× faster decode
SIMD bit-unpack (AVX2/NEON)	22–53× over scalar, ~50 GB/s (memory-bound)
~150 MiB realistic payload (Dense)	7.5× faster encode, 8.1× faster decode than json
Encode (Fast, pooled)	~1.1 GB/s, 3 allocs/op — vs ~1000 allocs/op for json & msgpack
Zero-copy decode (string batch)	7002 → 3 allocs, −38% B/op, ~1.7× faster
Codegen decode	up to 8.5× over json on a fixed schema
Selective decode (few columns)	~5× faster & ~5× less memory than full decode

Note the asymmetry: encode is a flat 3 allocations no matter the payload; decode allocations scale with how much you ask for — which is exactly why WithNoCopy and selective decode matter.

6. Which knob, when

One Options bitmask on the encode side. You never pass options to Unmarshal — it reads the header and handles whatever it gets.

Option	Reach for it when
`OptSpeed`	Hot path, single messages, sub-µs latency. msgpack-shaped. The drop-in `encoding/json` replacement. Also: use it if you hash/sign the bytes.
`OptBalanced`	Default for batches: Dense interning + adaptive numeric codecs. Big wire win, still fast.
`OptBalanced\	OptColumnIndex`
`OptCompression`	Cold storage. Adds Gorilla/ALP + rANS. Smallest wire; encode slower — write-once-read-rarely.
`WithNoCopy()`	Read-mostly over a buffer you own and won't mutate. Near-zero-alloc decode.
`AppendMarshal`	Own the output buffer for zero per-call allocation.
`qdfgen`	Fixed schema, every nanosecond counts — reflection-free generated methods.

The presets are just bundles of bits you'd compose by hand anyway:

const (
    OptSpeed       = 0 // Fast mode, nothing on
    OptBalanced    = OptDense | OptQPack | OptShapeIntern | OptPairPred | OptMTF
    OptCompression = OptBalanced | OptGorillaFloat | OptRANS
)

One axis, left to right: lowest CPU → smallest bytes. And every step is never-larger, so moving right never inflates a buffer.

OptSpeed  ──▶  OptBalanced  ──▶  OptCompression
fastest        −60% vs proto     smallest
≈ msgpack      still fast        slower encode

The same Logs-1024 batch, measured: json 245 KB → msgpack 193 KB → protobuf 156 KB → OptBalanced 90 KB → OptCompression 62 KB.

Two build tags — free performance, off by default

Orthogonal to Options: these change the generated machine code, not the wire. Same bytes, faster processing.

-tags qdf_simd — AVX2 (amd64) / NEON (arm64) bit-pack kernels, byte-identical output, runtime CPUID gate + scalar fallback. 22–53× over scalar. If you run OptBalanced/OptCompression on numeric data, turn it on — it's free.
-tags qdf_reflect2 — swaps reflect.MakeSlice/MakeMapWithSize/New for modern-go/reflect2 unsafe equivalents → smaller decode allocations on map/slice-heavy payloads. The one honesty note: this is the single opt-out from zero-dependency. Worth it if your data is map/slice-dense and you're not on codegen.

go build -tags "qdf_simd qdf_reflect2" ./... // combine freely

7. Streaming

enc := qdf.NewStreamEncoder(w, qdf.Dense)
for _, ev := range events { _ = enc.Encode(&ev) }
enc.Close()

dec := qdf.NewStreamDecoder(r)
for {
    var ev Event
    if err := dec.Decode(&ev); err == io.EOF { break } else if err != nil { return err }
}

The header is written once; the Dense intern table is shared across messages, so a 10k-row log pays for each distinct key once (the second message's "region":"eu-west-1" is a back-reference into the first). Each message is length-framed — a uvarint byte-count precedes its body — so a message of any size round-trips, even across a reader that hands you one byte per Read, and io.EOF marks the end cleanly. SetNoCopy works here too; aliases stay valid for the stream's lifetime because the window is never compacted.

QDF hdr   │ len₁ · msg₁ │ len₂ · msg₂ │ … EOF
5B once   │ uvarint+body│ uvarint+body│

Streaming and columnar are the two halves of the design: streaming is row-by-row for unbounded feeds; columnar is a complete batch you can query. So the whole-batch features — OptColumnIndex, Where/Select, OptRANS — aren't part of streaming, by design.

8. Where it doesn't win (the honest part)

OptSpeed wire ≈ msgpack — the speed tier skips columnar compression on purpose. Use OptBalanced when you want the bytes back.
The compression tier's encode is slower (Gorilla/ALP cost real CPU). It's a storage play, not a hot path.
protobuf and flatbuffers still win raw single-message decode and single-tiny-message size — generated code and zero-copy field access are hard to beat when there's no batch to amortize over. Different tool for "one small message, decoded whole, hot."

qdf's sweet spot is batches of structured records you want small on the wire and partially readable later: telemetry, logging, metrics, analytics, event sourcing.

Try it

go get github.com/alex60217101990/qdf

Pure Go, zero dependencies — nothing to vendor, no schema compiler in your pipeline. Swap it in where you use encoding/json, flip a batch path to OptBalanced|OptColumnIndex, read back just the columns you need — then go stare at your allocation graph.

Repo: github.com/alex60217101990/qdf
Full API reference: pkg.go.dev/github.com/alex60217101990/qdf

If the query model or the codec picker is useful to you, a ⭐ on the repo helps others find it. And if you find a payload shape where qdf loses that it shouldn't — open an issue with the fixture. That's the most useful bug report there is.

DEV Community