DEV Community: Aleksandr Yershov

Store embeddings at a fidelity budget, not a bit-count — a lossy vector codec for Go

Aleksandr Yershov — Thu, 02 Jul 2026 14:45:49 +0000

An embedding is a []float32. A store of a million of them is 512 MB of
float32 that you will spend most of its life scanning for nearest neighbours.
But here's the thing about that 512 MB: almost none of it is load-bearing.
Nearest-neighbour search only cares about the geometry of the vectors —
their relative angles — not the exact bits. Store them bit-exact and you're
paying full price for precision your retrieval never uses.

qdf — a schemaless binary serializer for Go — has an opt-in lossy vector codec
built on that observation. Instead of asking you how many bits to keep, it asks
what fidelity you need — "keep cosine similarity ≥ 0.99" — and then spends the
fewest bytes that clears that bar. This post is about the knob, the numbers, and
when you should (and shouldn't) turn it on.

The knob is a fidelity budget, not a bit-width

Every other quantizer I've used makes you pick a representation: int8, 4-bit,
this many centroids. That's backwards — you don't care about bits, you care
about whether retrieval still works. qdf inverts it. You set a budget on the
output quality and the codec picks the bits:

enc := qdf.NewEncoderWith(qdf.OptBalanced | qdf.OptLossyVec)
enc.SetVectorBudget(qdf.MinCosine(0.99)) // or MaxRelError / TargetSNR

Three ways to state the budget:

MinCosine(0.99) — bound the minimum cosine similarity between the original and reconstructed vector. This is the one for embeddings: cosine is what your ANN index compares.
MaxRelError(1e-3) — bound the per-vector relative L2 error. For when the magnitude matters, not just the direction.
TargetSNR(40) — target a signal-to-noise ratio in dB, for signal-ish float columns.

The codec only touches []float32 / []float64 columns of a []struct, and
only slices long enough to amortize its header (short vectors fall back to the
lossless path automatically). Everything else in the struct — the IDs, the
metadata — encodes losslessly as usual.

How it spends the bytes

Under the hood the pipeline is: rotate → quantize → entropy-code, with a
never-larger fallback.

A Hadamard rotation spreads each vector's energy evenly across its dimensions. This is the quiet workhorse — it turns a few large-magnitude components into many similar ones, which makes the quantization error isotropic and, crucially, makes the codec behave the same on "nice" smooth vectors and on adversarial ones (more on that below).
A scalar or E8-lattice quantizer maps the rotated components to a grid sized to hit your budget. The lattice option packs the same fidelity into fewer bits by exploiting that rotated components cluster near a sphere.
A static rANS entropy pass squeezes the residual redundancy out of the quantized symbols.
Never-larger fallback: if the whole lossy body ever comes out bigger than the plain lossless encoding — which can happen on tiny or already-compact columns — qdf ships the lossless bytes instead. Turning the codec on can never inflate your output. (This is the same discipline the rest of the format runs on; see the codec-selection writeup.)

The numbers — reproduce them yourself

Here's a harness you can paste and run. It builds a corpus, sweeps the cosine
budget, and measures the worst cosine actually achieved across every vector —
not the average, the worst, because a floor is only a floor if nothing falls
through it.

enc := qdf.NewEncoderWith(qdf.OptBalanced | qdf.OptLossyVec)
enc.SetVectorBudget(qdf.MinCosine(target))
_ = enc.EncodeValue(docs)
lossy := enc.Bytes()

var back []Doc
_ = qdf.Unmarshal(lossy, &back)
// then: worst = min over i of cosine(docs[i].Emb, back[i].Emb)

Run on 2000 vectors, float32, on ubuntu-latest. Two corpora: a smooth
one (sinusoids — the flattering case every quantizer paper uses) and a
random-unit one (Gaussian, L2-normalized — the honest worst case, essentially
incompressible and much closer to what a real embedding model emits).

128 dimensions (lossless baseline: 520 B/vec):

budget	random-unit B/vec	worst cosine	vs lossless
`cos≥0.99`	72.3	0.9955	−86%
`cos≥0.995`	80.3	0.9977	−85%
`cos≥0.999`	98.8	0.9995	−81%

768 dimensions (lossless baseline: 3080 B/vec):

budget	random-unit B/vec	worst cosine	vs lossless
`cos≥0.99`	407	0.9914	−87%
`cos≥0.995`	470	0.9957	−85%
`cos≥0.999`	618	0.9992	−80%

Two things worth calling out, because they're the honest part:

The budget holds. At cos≥0.99 the worst vector in 2000 lands at 0.9914– 0.9955 — above the floor, every time. The knob means what it says.
Random ≈ smooth. On the same run, the smooth corpus compressed to 72.1 B/vec and random-unit to 72.3 — a 0.3% difference. Most quantizers fall apart on incompressible input; the Hadamard rotation is why this one doesn't. If a vector codec only publishes numbers on smooth synthetic data, be suspicious. These are within noise of each other.

At cos≥0.99, 128-dim vectors land around 4.5 bits/dimension versus float32's
32 — a 7× shrink for a cosine hit you'd struggle to measure in recall@10 on most
indexes.

When to turn it on — and when not to

Situation	Verdict
Storing/shipping embeddings for ANN retrieval	Yes — cosine is exactly the metric the budget defends.
You can tolerate ~0.99 cosine (most RAG / semantic search)	Yes — start at `MinCosine(0.99)`, tighten if recall dips.
Vectors are the bulk of your payload	Yes — this is where −80…−87% actually moves your bill.
You need bit-exact reconstruction (dedup by hash, checksums, reproducible training)	No — use `OptBalanced`; lossy is opt-in for a reason.
Short vectors (< 32 elems) or vectors are a rounding error in your payload	Skip — the codec falls back to lossless anyway; no harm, no gain.
Downstream compares exact float equality	No — quantization changes the bits by design.

The rule of thumb: if your pipeline already treats embeddings as approximate —
and ANN search does — then storing them bit-exact is precision you're paying for
and throwing away. Pick the cosine floor your retrieval can live with and let
the codec find the bits.

Try it

go get github.com/alex60217101990/qdf
go run github.com/alex60217101990/qdf/examples/embeddings

The runnable
examples/embeddings
is the harness above, trimmed. Swap in your own vectors, sweep the budget, and
watch the worst-case cosine — if you find a corpus where the floor doesn't hold,
that's a bug and there's an issue template for it. Measured beats anecdotal.

Repo: https://github.com/alex60217101990/qdf

How a Go serializer picks the smallest encoding for every column — and never guesses wrong

Aleksandr Yershov — Wed, 01 Jul 2026 18:10:36 +0000

There is no single best way to encode a batch of records. A column of HTTP
status codes wants run-length encoding. A column of monotonically increasing
timestamps wants delta coding. A column of trace IDs wants substring
compression. A column of embeddings wants something else entirely. Pick one
codec for the whole message and you leave most of the win on the floor.

qdf — a schemaless binary serializer for Go — takes the opposite approach: it
transposes a []struct into columns and then chooses a codec per column,
measuring the candidates instead of guessing. And it does it under a rule that
makes the choice safe to turn on blindly: it can never produce a larger column
than the plain encoding. This post is about how that works and why the
never-larger rule is the part that actually matters.

Step 1: transpose to columns

Given a []struct, row-major encoding writes field-by-field, record after
record. That's what json, msgpack, and protobuf do — and it's why a repeated
"region":"eu-west-1" costs its full length in every row.

qdf, under its Dense/columnar path, pivots the batch: all the Status values
together, all the Timestamp values together, all the TraceID values
together. Now each column is a homogeneous array — and homogeneous arrays are
exactly what specialized codecs are good at.

Step 2: a menu of codecs per column type

Once you're looking at one column, the codec space opens up:

Integers / durations / counts

FOR (frame of reference) — subtract the column minimum, bit-pack the residuals. Great for bounded ranges (ports, status codes, small counters).
Delta + FOR — encode the first value plus bit-packed deltas against a running predictor. This is the one for monotonic sequences (timestamps, ids).
RLE — one (value, run-length) pair per run. Wins hard on enum-like columns where the same value repeats (log levels, booleans, sparse counters).
Dictionary — a table of distinct values plus a bit-packed index per row.
Patched FOR (PFOR) — FOR with an exception list for the few outliers that would otherwise blow up the bit width.

Floats

Gorilla XOR — XOR each sample against the previous one and store only the differing bits. Built for smooth time-series (sensor readings, gauges).
ALP — for decimal-ish []float64/[]float32 that are secretly fixed-point (prices, quantized values), store the integer mantissa.

Strings

Dictionary and front-coding for low-cardinality or shared-prefix columns (SIDs, DNs, paths, URLs).
Alphabet packing for high-cardinality values drawn from a small alphabet (hex / base32 / base64 IDs — store each char in ceil(log2|A|) bits).
FSST — a learned table of up to 255 substrings for high-cardinality free text (log lines, URLs), compressing at the byte level.

Whole body

rANS — a final static order-0 entropy pass that squeezes the residual byte-entropy the structural codecs leave behind.

That's a lot of choices. The interesting question is not "which codecs exist"
— it's "how do you pick, per column, without a config file and without getting
it wrong."

Step 3: probe, then pick the smallest

For each column, qdf runs a cheap bounded probe that predicts the encoded size
of the viable candidates, then emits the smallest. The probe is designed to be
much cheaper than actually encoding every candidate — it estimates from column
statistics (min/max, run structure, distinct count) rather than doing the full
work five times.

The expensive tiers (Gorilla, FSST, rANS) are gated behind opt-in flags
(OptCompression), because they trade encode CPU for bytes and you don't always
want that trade. The cheap structural codecs (FOR, Delta, RLE, dictionary) run
on the default OptBalanced tier.

Step 4: the never-larger guarantee

Here's the rule that ties it together: for every codec, qdf compares the
candidate encoding against the plain one and emits the compressed form only when
it is strictly smaller. If a "compression" codec would make a column bigger —
which absolutely happens on adversarial or already-incompressible data — qdf
emits the plain encoding instead.

The consequence is the useful part: turning compression on can never inflate
your output. You don't have to reason about whether your data is a good fit.
You don't have to benchmark before flipping the flag. The worst case is "no
better than plain," never "worse than plain." That property is what lets qdf
auto-select aggressively instead of shipping a pile of knobs.

It also composes down to the whole message: the final rANS pass is applied only
when it shrinks the body, so OptCompression is never larger than
OptBalanced, which is never larger than the plain encoding.

What it buys you, measured

On real telemetry batches (GitHub Actions ubuntu-latest, Go 1.26), wire size
versus protobuf:

batch	qdf balanced	qdf compression
OTLP traces	−75%	−77%
logs	−72%	−72%
RTB bids	−25%	−39%
events	−39%	−39%
IoT floats	−24%	−29%

The wins track the data: OTLP and logs are string-heavy and repetitive, so
interning + columnar string codecs dominate; RTB and IoT are less repetitive, so
the numeric codecs do the work and the margins are smaller. That's the honest
shape of it — the codec picker is only as good as the redundancy in your data.

The discipline behind the menu

The codec list above is the survivors. The measure-first process that picked
per-column codecs also killed a lot of ideas that looked good on paper:
GPU-offloaded rANS (only wins on multi-MB single bodies — qdf messages are KB),
SIMD-gathered rANS (5× slower than scalar interleaved pre-AVX512),
multicore columnar encode (memory-bandwidth-bound, ~1.0×), and a learned
ScaNN-style vector quantizer (measured under 1pp recall gain). Every codec in
the menu earned its slot on a benchmark, and none of them can make your output
bigger. That's the whole design in one sentence.

Try it

go get github.com/alex60217101990/qdf

data, _ := qdf.Marshal(batch, qdf.OptBalanced) // or OptCompression
var back []Record
_ = qdf.Unmarshal(data, &back)

Runnable examples (telemetry, query-the-bytes, embeddings, streaming,
zero-alloc decode) are in
examples/, and
the per-codec details live in the repo.

If you find a payload where a codec loses that it shouldn't — there's an issue
template for exactly that. Measured beats anecdotal.

Repo: https://github.com/alex60217101990/qdf

Shrinking AI embeddings on the wire — a lossy vector codec that beats Google's TurboQuant at equal recall

Aleksandr Yershov — Sat, 27 Jun 2026 09:18:40 +0000

A developer's walk-through of qdf's opt-in lossy vector codec: what it does,
why it lands within a hair of the information-theoretic floor, and how it
measures up against Google's TurboQuant on a reproducible benchmark.

The problem nobody budgets for

A single 768-dimensional float32 embedding is 3,072 bytes. That sounds
harmless until you have a few million of them. A 10M-document RAG index is
~30 GB of just vectors — before metadata, before the ANN graph, before
replication. Embeddings are quietly the dominant storage and bandwidth line item
of every vector database and every retrieval pipeline.

Here's the thing: embeddings are not telemetry that must round-trip
bit-for-bit. Nobody cares whether coordinate 412 comes back as 0.0193847 or
0.0193851. What you care about is that nearest-neighbour search returns the
same neighbours — i.e. that cosine similarity is preserved to a few decimal
places. That is the exact regime where lossy quantization is free money, and
it's why qdf ships an opt-in lossy codec for []float32 / []float64
fields (OptLossyVec). Off by default, so no exact workload is ever silently
approximated; one flag flip when you want it.

The part I find genuinely nice as an application developer: you don't bolt on a
second system.

You serialize your []struct{ ID, Text, Emb []float32 } with the same
Marshal/Unmarshal you already use. The scalar and string fields stay
bit-exact; the vector field is batched into one lossy column; the blob is
self-describing, so Unmarshal rebuilds the records with no flag and no side
schema. Metadata store and vector store collapse into one blob with one write
path. (See Example_aiEmbeddingStore in the package docs.)

What the codec actually does

For each float-vector column the encoder runs a four-idea pipeline. Three of the
four are things a CPU serializer can do that a fixed-width GPU codebook cannot —
and that's exactly where the size edge comes from.

Randomized Hadamard rotation — R = (1/√n)·H·D, a seed-driven sign-flip diagonal D composed with the Walsh–Hadamard transform H. It spreads per-coordinate outliers evenly so the data becomes approximately Gaussian — the ideal shape for low-bit quantization — at O(n·log n) cost and with no stored matrix: just a uint64 seed on the wire. This is the same idea Google's TurboQuant uses for KV-cache quantization.
A lattice, not a grid. The scalar quantizer snaps each coordinate to the nearest multiple of a step δ (the Z lattice — a cube). The E8 quantizer groups coordinates into 8-D blocks and snaps each to the nearest point of E8, the densest packing in 8 dimensions. E8's Voronoi cell is rounder than the cube, so it spends fewer bits for the same distortion.
Entropy-code the indices. After the rotation the quantization indices are near-Gaussian — a peaked distribution. A fixed-width code wastes bits on that; an order-0 rANS pass (the same entropy stage the rest of qdf uses) recovers them.
Never-worse. The encoder builds both quantizers and the plain lossless float encoding, and keeps whichever is smallest. OptLossyVec is a hint, never a commitment to inflate — an incompressible or exception-heavy column silently falls back to lossless.

NaN/±Inf are pulled into an exception list before quantization and written
back bit-exactly on decode, so non-finite values survive any budget untouched.

Why this is close to the performance ceiling

This is the claim worth defending, so let me be precise about which ceiling.

Distortion-rate. For a fixed quantizer, the bits you must spend at a target
distortion are bounded below by the source entropy after the optimal transform.
The pipeline attacks every term of that bound:

The Hadamard rotation decorrelates and Gaussianizes the coordinates. Rate- distortion theory says the Gaussian is the worst case for a fixed coder but the best-understood case for an optimal one — and crucially the rotation makes the per-coordinate distribution uniform, so a single step δ is near- optimal for every coordinate instead of being dragged around by outliers.
The E8 lattice captures the space-filling (granular) gain. Its normalized second moment is G ≈ 0.0717 versus the scalar cube's 1/12 ≈ 0.0833 — a ~0.65 dB coding gain, which is most of the gain available in 8 dimensions short of an impractically large vector quantizer.
rANS captures the entropy (coding) gain — the bits a fixed-width index leaves on the table once the distribution is peaked.

Granular gain + entropy gain are the two levers a practical quantizer has.
qdf pulls both. The only thing left on the table is a higher-dimensional
lattice (Leech in 24-D buys a further fraction of a dB) — which was prototyped
and measured-killed: the extra coset bookkeeping cost more than the packing
saved at these rates. That's the signature of being near the practical floor — the next idea
loses.

Implementation. Separate from the bits-on-the-wire ceiling, the encoder is
allocation-bound, not algorithm-bound, in steady state. Reusing scratch across
calls (the pooled Marshal path) takes a 256×768 batch from 13,855 → 1,308
allocs/op and 21.2 MB → 2.0 MB/op — ~10× each — with byte-identical
output. Profiling past that point shows the encoder is output-bound: there's no
hot loop left to shave.

The benchmark: vs Google's TurboQuant (and naive, and PQ)

All methods compared on the same synthetic Gaussian corpus
(2,000 vectors × 256 dims), each with pre-built, buffer-reusing scratch so the
timing is apples-to-apples. Reproduce:

go run github.com/alex60217101990/qdf/cmd/qdf-vecbench@latest -synthetic -n 2000 -dim 256
# raw rows live in cmd/qdf-vecbench/rd.csv

The money chart is recall-vs-size. A method is better when its curve sits to the
left — fewer bytes at the same recall.

qdf-lossy is Pareto-better than both scalar baselines across the whole useful
recall band. Read off the iso-recall line at recall ≈ 0.90:

Method	bytes / vector	recall@10	notes
qdf-lossy (knob 0.05)	170.4	0.931	smaller and higher recall
naive scalar (5-bit)	176.0	0.901
TurboQuant rotated-scalar (5-bit)	184.0	0.903	rotation, but no entropy/lattice

At ~the same byte budget, qdf delivers higher recall; at ~the same recall, it
spends fewer bytes. docs/LOSSY-VECTOR.md quotes the headline as
≈17–22 % smaller at equal reconstruction quality (−18.8 % vs naive, −22.3 %
vs TurboQuant at the rel ≈ 0.05 operating point), and the win widens to
−21 % at looser budgets.

Notice TurboQuant lands between naive and qdf: the rotation alone helps (it's
why TurboQuant beats naive at high recall), but without the lattice and the
entropy stage it can't reach the qdf curve. That gap is the entropy + granular
gain, made visible.

What about Product Quantization? PQ is on the same chart's data
(rd.csv) and it's a fair question. The honest answer: on this corpus at
comparable quality it doesn't compete. PQ hits tiny sizes (2–16 B/vec) but
recall@10 collapses to 0.02–0.11 at those rates — it needs a trained
codebook and many more subspaces to approach 0.9 recall, which is a different
operating regime (and a separate training step). For a drop-in, training-free,
self-describing serializer codec, the scalar/lattice family is the right
comparison, and qdf wins it.

The honest trade

Smaller wire is not free. qdf does strictly more work per vector — a rotation
and an entropy-decode on the read path, plus a verify-loop on encode — than a
bare scalar quantizer:

Method	enc MB/s	dec MB/s	enc allocs
qdf-lossy	174	526	1
naive scalar	993	4054	0
TurboQuant rotated-scalar	543	949	0

(warm, buffer-reusing; from docs/LOSSY-VECTOR.md)

So if your bottleneck is raw quantization throughput, a naive scalar codec is
faster. qdf's codec is built for the write-once, read-many embedding store
where storage and bandwidth dominate the bill and a few hundred MB/s of encode
is irrelevant next to a 20 % smaller index replicated across a fleet. Pick the
tool for the bottleneck you actually have.

Why this instead of the usual stacks

The recall-vs-size chart settles how small qdf goes. But "use qdf" is an
architecture decision, not just a codec choice, so here's the honest comparison
against the four things you'd reach for otherwise.

A dedicated vector DB (FAISS / pgvector) with Product Quantization. This is the heavyweight answer, and it's the right one once you need a served, indexed, updatable ANN system at scale. But it's two systems — a vector store next to your metadata store — that you keep in sync, and PQ needs a trained codebook (a separate fit step, re-fit when the embedding model changes). qdf is the opposite trade: no training, no second store, no index to rebuild — a single file you Marshal once and Unmarshal anywhere. Use the vector DB when you've outgrown a flat scan; use qdf for the store underneath it, or for the very common case where a brute-force cosine scan over a few hundred thousand vectors is already fast enough.
protobuf / msgpack with raw float32. This is what most pipelines actually ship today, and it's exact — but it stores the full 1,024 bytes per 256-dim vector and copies every field on decode. You get correctness and a familiar format; you pay 6–7× the bytes of qdf-lossy and a per-field allocation on every read. If your vectors genuinely must be bit-exact, this is correct (and so is qdf with the flag off). If they're search vectors, you're paying for exactness nobody consumes.
Roll-your-own scalar quantization (int8 + glue code). The DIY path: quantize to int8, stuff it into msgpack, write a decoder. It works and it's small-ish (~176 B at 5-bit), but now you own a wire format, a dequantizer, and the edge cases (NaN/Inf, varying norms, the "is this column even worth quantizing" decision). qdf gives you the rotation + lattice + entropy stack, the never-worse guarantee, and NaN/Inf survival for free — and it's self-describing, so the reader needs no out-of-band schema.

The one-line version: qdf is the training-free, single-blob, self-describing
option. It won't beat a tuned PQ index on raw bytes, and it won't beat raw
float32 on encode speed — but it's the only one of the four where the metadata
and the vector live in one Marshal/Unmarshal you already know, with a
correctness floor (never-worse, exact-by-default) baked in.

Where it's useful

Situation	Use it?	Budget knob
Embedding store / RAG index (ANN search)	Yes — headline use case	`MinCosine(0.999)`
Bandwidth-bound embedding transfer	Yes	`MinCosine` or `MaxRelError(0.01)`
Model weight / activation tensors	Yes, with care	`MaxRelError` / `TargetSNR`, validate downstream
Exact scientific / financial floats	No — leave the flag off	—
Short vectors (< 32) or scalar float fields	n/a — won't fire	stays lossless automatically

The budget API speaks in the metric you actually reason about:

enc := qdf.NewEncoderWith(qdf.OptBalanced | qdf.OptLossyVec)
enc.SetVectorBudget(qdf.MinCosine(0.999)) // keep cosine similarity >= 0.999
if err := enc.EncodeValue(rows); err != nil {
    log.Fatal(err)
}
data := enc.Bytes()

var out []EmbedRow
_ = qdf.Unmarshal(data, &out) // no flag needed; the 0xFD tag self-describes
// out[i].Emb approximates rows[i].Emb with cosine >= 0.999

MinCosine bounds the dot-product metric ANN relies on; MaxRelError(eps)
bounds per-vector L2 error directly; TargetSNR(db) suits signal-style data.
Looser budget ⇒ smaller and faster — pick the loosest your downstream task
tolerates and verify recall on a held-out query set.

Production best practices

The codec is only half the win. The other half is decoding it without throwing
the size advantage away on allocations — embedding decode is allocation-bound,
not CPU-bound, so where the bytes land matters as much as how few there are.
This section is the part the API docs assume you'll figure out.

Write path: reuse one encoder, batch the column

Two habits make the encode side cheap:

Reuse a *qdf.Encoder across calls. qdf.Marshal allocates a fresh encoder state per call; a long-lived encoder reuses its rotation, coordinate, widen, and rANS scratch. On a 256×768 batch that's the 13,855 → 1,308 allocs/op (21.2 MB → 2.0 MB/op) difference — ~10× — with byte-identical output.

   enc := qdf.NewEncoderWith(qdf.OptBalanced | qdf.OptLossyVec)
   enc.SetVectorBudget(qdf.MinCosine(0.999))
   for _, batch := range batches {
       _ = enc.EncodeValue(batch) // scratch reused across iterations
       write(enc.Bytes())
   }

Marshal the whole []struct, not vector-by-vector. When the element has a []float32/[]float64 field, qdf gathers every row's vector into one count-N column block (wire tag 0xFE) instead of one block per row. That amortizes the block header and the rANS frequency framing across the batch — the per-row form costs ~290 B/vec vs ~176 B/vec batched on a 256-dim corpus. So the headline numbers are the ones you actually get in production, because you marshal the batch. (Needs ≥ 16 rows with the same vector length.)

If you re-encode the same string-column shape repeatedly (same URL space, same log
format alongside the vectors), train an FSSTDict once and reuse it — it skips
the per-batch symbol-table training, ~5× faster encode.

Read path: three ways to decode, pick by buffer ownership

This is the lever most people miss. The default Unmarshal copies each string
field into its own heap allocation — always correct, but a record with seven
string fields pays seven allocations and the GC then scans seven objects. There
are two cheaper paths, and which one is safe depends entirely on who owns the
wire buffer and how long it lives.

WithArena — copy once, packed. Bump-appends every decoded string into one contiguous block per epoch instead of N separate allocations. The strings are byte-identical; only where they live changes. Across a batch the block amortizes to ~0 allocations, the strings sit cache-adjacent, and the GC walks one object instead of N. Measured −26…−35 % decode time on string-heavy corpora (4,856 → 605 allocs/op on an AD-style export). It is safe with a recycled wire buffer — because it copies the strings out, you can hand the buffer straight back to a pool. This is the right default for a server handler or a streaming consumer.

  a := qdf.NewArena()
  for msg := range stream {
      var rows []Doc
      _ = qdf.Unmarshal(msg, &rows, qdf.WithArena(a))
      use(rows)
      a.Reset() // only after every value from the last decode is dead
  }

Reset is a manual use-after-free contract — call it only once everything
decoded since the last reset is dead. If you can't reason about that, drop the
arena and NewArena() again; never-Reset is always safe (the block is plain
GC memory).

WithNoCopy — zero copy. Decoded strings/[]byte alias the input buffer directly: zero copies, zero allocations, ~1.7× faster. The catch is the lifetime — the values are valid only while the input stays alive and unmodified. Use it on owned, long-lived, read-only input. Never on a pooled/recycled buffer (a server request body): the aliased values become silent garbage when the buffer is reused — a use-after-free the race detector won't catch.

The decision is mechanical: recycled buffer → WithArena; owned long-lived
buffer → WithNoCopy; unsure → default copy.

The flagship pattern: an mmap'd, zero-copy embedding store

WithNoCopy's "owned, long-lived, read-only" requirement is exactly what an
mmap'd file is — which makes it the natural backing for a write-once / read-many
embedding index. You marshal the whole corpus into one self-describing .qdf
file once; readers mmap it and Unmarshal with WithNoCopy, serving vectors
straight out of the page cache with no per-read allocation and no copy.

// Writer — once, offline.
enc := qdf.NewEncoderWith(qdf.OptBalanced | qdf.OptLossyVec)
enc.SetVectorBudget(qdf.MinCosine(0.999))
_ = enc.EncodeValue(corpus)        // []Doc{ID, Title, Emb []float32}
_ = os.WriteFile("index.qdf", enc.Bytes(), 0o644)

// Reader — many times, hot.
f, _ := os.Open("index.qdf")
buf, _ := syscall.Mmap(int(f.Fd()), 0, size, syscall.PROT_READ, syscall.MAP_SHARED)
defer syscall.Munmap(buf)

var docs []Doc
_ = qdf.Unmarshal(buf, &docs, qdf.WithNoCopy()) // strings alias the mmap; vectors materialize
// docs[i].Emb is the approximated vector; scan / ANN over it directly.

The vector field itself is reconstructed (the lossy decode allocates the output
slice — there's nothing to alias), but the ID/Title metadata and any other
string columns cost zero. The whole index is one file, one mmap, one Unmarshal
— no second store, no schema sidecar. Keep the mmap mapped for as long as you
read docs; Munmap only after they're done.

Choosing and validating the budget

The budget knob is the one parameter that actually matters, so don't guess it:

You reason about…	Knob	Note
ANN recall (cosine / dot-product)	`MinCosine(0.999)`	bounds the metric the index uses — start here for RAG
reconstruction error directly	`MaxRelError(0.01)`	per-vector L2; tighter `eps` ⇒ more bytes
signal-style data (audio, sensor)	`TargetSNR(db)`	dB framing

A looser budget is smaller and faster. The discipline: pick the loosest
budget your downstream task tolerates, then verify recall@k on a held-out query
set — encode the corpus, decode it, and confirm the top-k neighbours of your
eval queries are unchanged (the Example_aiEmbeddingStore test does exactly this
top-1 check). Tighten the budget only if recall actually drops. Because the codec
is never-worse, the failure mode of an over-tight budget is "no smaller than
lossless," not "corrupt" — you lose the size win, not correctness.

A short checklist

[ ] OptLossyVec only on search/embedding vectors — never exact floats.
[ ] Marshal the []struct batch (≥ 16 rows) so the vector column batches.
[ ] Reuse a *qdf.Encoder if you encode in a loop.
[ ] Decode: WithArena for pooled buffers, WithNoCopy for mmap, copy if unsure.
[ ] Verify recall@k on held-out queries; loosen the budget to the floor your task allows.

A note on trust

I deliberately sourced every figure from committed artifacts:

pipeline & wire format → docs/LOSSY-VECTOR.md
decode paths (arena / zero-copy) → docs/ARENA.md
rate-distortion / recall rows → cmd/qdf-vecbench/rd.csv, generated by the qdf-vecbench tool
runnable end-to-end → Example_aiEmbeddingStore in example_lossyvec_test.go
reproduce → go run github.com/alex60217101990/qdf/cmd/qdf-vecbench@latest -synthetic -n 2000 -dim 256

The benchmark is synthetic Gaussian data, which is the friendly case for every
method here; the relative ordering (qdf < TurboQuant < naive at equal recall) is
the structural result and it holds because it comes from the algorithm, not the
corpus. On real embeddings the absolute bytes shift, but the entropy + lattice
gain that puts qdf's curve to the left does not.

Takeaways

Embeddings dominate vector-DB storage and they don't need bit-exactness — lossy quantization is the right tool, and it can live inside your serializer instead of a second system.
qdf's codec pulls both practical quantization levers (E8 granular gain + rANS entropy gain) on top of the TurboQuant-style rotation — which is why it's Pareto-better than rotated-scalar at equal recall, and why the next idea (Leech) measured worse.
It's an honest CPU-for-size trade: lower throughput, smaller wire, near-zero steady-state allocations. Right for write-once / read-many stores.
The size win is only realized if you decode right: WithArena for pooled buffers, WithNoCopy over an mmap'd file for a zero-copy read-many store. Decode is allocation-bound — where the bytes land matters as much as how few.
Versus the alternatives it's the training-free, single-blob, self-describing option: it won't beat a tuned PQ index on raw bytes or raw float32 on encode speed, but it's the only one with metadata + vector in one Marshal/Unmarshal and a never-worse / exact-by-default correctness floor.
One flag, one blob, no schema, never-worse. qdf.OptLossyVec.

qdf: a Go serializer that decodes less, packs harder, and lets you query the bytes

Aleksandr Yershov — Wed, 03 Jun 2026 12:02:53 +0000

TL;DR for the impatient. qdf is a schemaless Go serializer (struct tags, no .proto). On real batches it's up to 68% smaller than protobuf, decodes 4–9× faster than encoding/json, ships hand-written AVX2/NEON bit-packing at ~50 GB/s, and does one thing no other mainstream Go serializer does: it can run SELECT … WHERE … over a []byte and decode only the columns and rows you asked for. Pure Go, zero dependencies. github.com/alex60217101990/qdf

This is the engineering deep-dive, not the marketing page. We're going to look at actual hexdumps, the codec picker's never-larger guarantee, the twin-bitmask three-valued predicate engine, and a profiler-driven argument about why your decode path is slow for a reason you probably haven't measured. If you write Go services that serialize the same five shapes forever — logs, events, metrics, RTB bids, OTLP spans — this is for you.

The problem nobody's format actually solves

Every binary serializer makes you pick two of three:

	schemaless	small wire	fast / cheap
`encoding/json`	✅	❌	❌ (allocates a mountain)
msgpack	✅	⚠️ (per-record)	⚠️
protobuf / flatbuffers	❌ (`.proto` + codegen)	✅	✅

JSON is universal and schemaless and burns CPU and GC like it's free. msgpack is smaller but you still decode the whole blob to read one field. protobuf and flatbuffers are fast and compact — right up until you're maintaining .proto files and a codegen step for what used to be a plain struct.

qdf is an attempt to refuse the tradeoff: self-describing wire (decode straight into a struct, no schema), protobuf-class sizes on batches, genuinely extreme decode speed, and a columnar mode you can query. Let's see how, byte by byte.

type Event struct {
    TS    int64  `qdf:"ts"`
    Level string `qdf:"level"`
    Code  int32  `qdf:"code"`
}

b, _ := qdf.Marshal(events, qdf.OptBalanced) // []Event -> []byte
var back []Event
_ = qdf.Unmarshal(b, &back)

Struct tags name fields, exactly like json:. No registry, no generated types to keep in sync. The decoder figures out mode, codecs and compression from the wire itself — you never pass options to Unmarshal.

1. The wire format in one look

A qdf buffer is a 5-byte header + a tagged body. That's the whole envelope.

51 44 46   01    XX        [ tagged body … ]
'Q' 'D''F' ver  flags       bytes 5 … N

The flags byte is a tiny bitmap telling the decoder which dialect the body speaks, so it can fast-path or reject before parsing a single value:

FlagDense (0x01) — body uses the Dense intern dialect (back-reference tags).
FlagQPack (0x02) — body may carry the QPack numeric/bool codec tags.
FlagRANS (0x04) — body is rANS-compressed; decompress first.
FlagColIndex (0x08) — a columnar payload carries a per-column length index (this is what makes selective decode an O(1) skip).

The base tag space is msgpack-shaped — fixint, fixstr, fixarr, typed scalars, str/bin/arr/map in 8/16/32 widths, negfixint. On top of that sit the Dense back-reference tags and the QPack codec tags. That base layer is why a Fast-mode qdf buffer is about as small as msgpack and just as quick; the extra tags are where qdf pulls ahead on batches.

An actual buffer, byte for byte

Encode one &Event{TS:7, Level:"ERR", Code:500} with OptSpeed → 29 bytes, every one accounted for:

51 44 46 01 00              QDF, ver 1, flags 0x00 (Fast)
d5 03                       map, 3 fields
82 74 73 07                 "ts"  -> fixint 7
85 6c 65 76 65 6c 83 45 52 52   "level" -> fixstr "ERR"
84 63 6f 64 65 c4 f4 01     "code" -> uint16 0x01F4 (500)

Two details that tell you how the encoder thinks:

It picked the narrowest tag that holds the value. 500 went out as a 2-byte uint16, not a 4-byte int32. The picker always reaches for the smallest tag, per value.
There's no schema anywhere. The keys ts/level/code are in the bytes. That's the cost of being schemaless on a single message — and exactly what Dense mode erases on a batch.

Flip to OptBalanced on a slice of these and the repeated keys (ts/level/code) and repeated values ("ERR") collapse to 1-byte back-references after first sight. Which brings us to the encoder.

2. Encode: it measures, then packs

qdf doesn't pick one scheme and pray. The encode pipeline:

value → typeDesc cache → columnar transpose → per-column codec picker
      → Dense intern → rANS (opt-in) → []byte

Reflection runs once per type, ever. The first call for a type builds a type descriptor — a flat array of encode/decode closures over unsafe field offsets — and caches it in a sync.Map. Every later call touches only those closures: no reflect.Value churn, no per-field type switch on the hot path.

The codec picker and the never-larger rule

For every numeric/bool slice the encoder runs a cheap bounded probe and emits the smallest of a family. The comparison includes the raw form, so if nothing wins it falls back — turning compression on can never inflate a slice. This "never-larger by construction" property is the whole reason you can flip OptBalanced on blindly.

codec	idea	wins on
FOR	store `value − min`, bit-pack to width of `max−min`	bounded ranges (HTTP codes 200–504 → ~10 bits, not 32)
Delta+FOR	FOR over consecutive differences	monotonic-ish columns: timestamps, IDs, offsets
RLE	`(value, run-length)` pairs	long runs: status, enum, sparse flags
Dictionary	distinct table + bit-packed indices (`ceil(log2 d)` bits/row)	low cardinality, incl. string columns (level, region)
Patched FOR	FOR + an exception list for outliers	mostly-narrow columns with a few spikes

Delta+FOR, with the actual bytes

Take []int64{1000, 1001, …, 1009} — ten 8-byte integers, 80 bytes raw. Marshal(ints, OptQPack) gives 12 bytes total:

00000000  51 44 46 01 02 e6 07 00  d0 0f 02 0a   |QDF.........|

Header is 5 bytes (flags 0x02 = QPack), so the body is 7 bytes for ten int64s. Codec 0xE6 = Delta+FOR: it stored the first value, the minimum delta, and the residual deltas bit-packed. Since every delta is exactly 1, the residuals collapse to almost nothing.

That's the mechanism behind the headline 512× compression on monotonic timestamp vectors — a clock column is the perfect case: large absolute values, tiny constant deltas.

SIMD bit-packing — same wire, faster code

The bit-pack/unpack kernels are hand-written assembly: AVX2 on amd64, NEON on arm64, and they emit byte-identical output to the scalar path. Tests assert scalar ≡ SIMD bit-for-bit. So -tags qdf_simd is purely faster, never a different wire — runtime CPUID gate, scalar fallback on anything without AVX2.

22–53× over scalar at byte-aligned widths
~50 GB/s unpack (memory-bound there, not compute-bound)

If you run OptBalanced/OptCompression over numeric data, this build tag is free money:

go build -tags qdf_simd ./...

Implementation note for the SIMD-curious: the decode kernels lean on VPMOVZX widen-loads and VPBROADCASTQ+VPSRLVQ variable-per-lane shifts (a per-offset shift table picks the bit offset for each lane); encode uses VPSHUFB byte-gather and VPSLLVQ+lane-OR. On arm64, several of those have no direct Plan9 mnemonic and get hand-encoded via WORD. It's the kind of code where "byte-identical to scalar" is a property you test, not hope for.

The four-layer Dense dialect (strings & structure)

Repeated strings and field names are where batch formats bleed. Dense mode stacks four mechanisms so the second occurrence of a value is nearly free. Take []string{"eu-west-1","eu-west-1","eu-west-1"} under OptBalanced — 19 bytes:

00000000  51 44 46 01 03 a3 e0 09  65 75 2d 77 65 73 74 2d  |QDF.....eu-west-|
00000010  31 e8 e8                                          |1..|

bytes	meaning
`51 44 46 01 03`	header, flags `0x03` (Dense \| QPack)
`a3`	fixarr, 3 elements
`e0 09 65…31`	1st value: intern declaration — tag + len 9 + `"eu-west-1"`
`e8`	2nd value: one-byte back-reference
`e8`	3rd value: one byte again

First "eu-west-1" costs 11 bytes; each repeat costs 1. That's the whole game on telemetry, where region/service/level repeat across thousands of rows. The four layers producing those one-byte refs:

Intern table — first sight stored, assigned an id; later sights become a varint reference.
Move-to-front — the hot set resolves in 1–2 bytes via a small MRU ring (recent values get the shortest codes).
Markov-0 "same as last" — a value equal to the previous one is a single repeat tag (the e8 above).
Markov-1 pair predictor — if "GET" is usually followed by "/health", the predicted successor collapses too.

Floats get Gorilla (lossless XOR coding over math.Float64bits — bit-exact for NaN/±Inf/−0.0, never ==) and ALP (decimal-mantissa for quantized metrics/prices, with an exception list for anything that doesn't round-trip exactly). The opt-in order-0 rANS pass is the final never-larger squeeze for cold storage.

The structural win (and the gotcha)

Here's why qdf lands smaller than protobuf on real batches: it dedups and compresses across records. protobuf, msgpack, json and flatbuffers encode each record independently, so a repeated string or a smooth float series re-pays its cost every single row. qdf pays once per batch.

Gotcha #1: that cross-record win needs a batch. On a single small message there's nothing to dedup, so OptBalanced ≈ OptSpeed ≈ msgpack in size — use OptSpeed there and skip the Dense bookkeeping.

Gotcha #2: the Dense wire embeds intern/shape ids that depend on emission order, so two semantically-equal payloads can differ byte-for-byte. If you hash or sign the bytes, encode with OptSpeed.

3. The headline: read less than the whole message

Hand qdf a []struct and it transposes rows into columns — think Parquet, but automatic and still self-describing. Each column then gets the codec that fits it: timestamps go Delta+FOR, an enum-ish level goes dictionary, a run-heavy code goes RLE.

rows ([]Event)              columns (each its own codec)
┌────┬───────┬──────┐       ┌──────────┬────────┬──────┐
│ ts │ level │ code │  →    │ ts ts ts │ level… │ code…│
│ …  │  …    │  …   │       │ Delta+FOR│  dict  │ RLE  │
└────┴───────┴──────┘       └──────────┴────────┴──────┘

With OptColumnIndex the encoder also writes, right after the shape declaration, a fixed-width index: one uint32 byte-length per column. That index is the key — it lets the decoder compute each column's start offset and jump straight past any column it doesn't need, without parsing a byte of it.

Querying the bytes

buf, _ := qdf.Marshal(events, qdf.OptBalanced|qdf.OptColumnIndex)

// "SELECT ts, code WHERE level='ERROR' AND code>=500" — over a []byte.
type Hot struct {
    TS   int64 `qdf:"ts"`
    Code int32 `qdf:"code"`
}
var hot []Hot
_ = qdf.Unmarshal(buf, &hot,
    qdf.Where("level", func(s string) bool { return s == "ERROR" }),
    qdf.Where("code",  func(c int32) bool { return c >= 500 }))

What the decoder actually does, in order:

Read the shape + column index. Now it knows where every column starts.
Filter columns — decode only the columns named in a predicate (level, code). Run each predicate across its whole column to produce a per-row bitmask.
Combine the masks (AND here) into the surviving-row set.
Project — for the columns Hot wants (ts, code), materialize values only at the surviving rows. level was read to filter, then dropped because Hot doesn't contain it. Every other column is skipped via the index — its bytes are never parsed.

The predicate engine: twin bitmasks + SQL three-valued logic

It isn't just AND-of-equals. And, Or, Not compose into a real predicate tree — and the tricky part is nullable columns: in SQL, a comparison against NULL is neither true nor false, it's UNKNOWN. qdf gets this right with twin bitmasks per node: a T mask (rows definitely true) and an F mask (rows definitely false). Anything in neither is UNKNOWN.

Leaf: run the predicate per present row → fills T; F = present &^ T (present-but-not-true). Absent (nil) rows land in neither — UNKNOWN, for free.
AND: T = T₁ & T₂, F = F₁ | F₂ (false if any child is false — even if another is unknown).
OR: T = T₁ | T₂, F = F₁ & F₂.
NOT: swap T and F (unknown stays unknown).

The final result keeps only rows in the root T mask — TRUE, never FALSE, never UNKNOWN — which is exactly SQL WHERE semantics.

A neat optimization: a subtree with no nullable leaves can't produce UNKNOWN, so qdf skips materializing its F mask entirely and treats "not true" as the complement — one fewer pass over the rows.

_ = qdf.Unmarshal(buf, &hot,
    qdf.Or(
        qdf.Where("level", func(s string) bool { return s == "ERROR" }),
        qdf.And(
            qdf.Where("code", func(c int32) bool { return c >= 500 }),
            qdf.Not(qdf.Where("level", func(s string) bool { return s == "DEBUG" })),
        ),
    ))

The predicate is called once per row against the native typed value — func(int32) bool, func(string) bool — with zero interface boxing. Pure projection without a filter is just Select("ts","code").

No mainstream Go serializer does this. json, msgpack, protobuf, gob — all decode the whole message before you can read one field. For "store a wide batch, read a few columns or filter rows later," qdf is the only one that reads less than everything.

Concretely, on a wide batch at low selectivity (i7-9750H):

~5× faster than full decode (projection)
~5× less memory than full decode
~2.5× faster than decode-everything-then-filter

When it applies: you need OptColumnIndex at encode time, a []struct batch, and flat-ish fields. The bigger and wider the batch and the more selective the query, the larger the win. It's the columnar-warehouse pattern brought to a plain Go []byte — no database, no schema. (It is not for single messages or streaming — that's the row-by-row half of the design.)

4. Decode: the fastest work is the work you skip

Here's the claim that should change how you think about serializer performance:

Profile any serializer's decode and the truth is the same: it's allocation-bound, not CPU-bound.

Run go test -memprofile on a string-heavy decode and look at -alloc_objects. On qdf's row path it's almost entirely one call: (*Decoder).ReadString — copying string bodies out of the buffer into owned Go strings. Tag walking, bounds checks, type dispatch — rounding error. So the levers that matter aren't clever ALU tricks. They're don't allocate and don't decode.

Lever 1 · Zero-copy decode

var out []Event
_ = qdf.Unmarshal(data, &out, qdf.WithNoCopy()) // strings alias data, no copy

WithNoCopy returns strings and byte slices that point into data instead of copying out. On a string-heavy batch: ~1.7× faster, 7000+ allocations collapse to 3 (the only one left is the output slice). The decoder is already pooled and its scratch buffers reused, so with aliasing there's essentially nothing left to allocate per value.

The catch is honest and it's in the name. The returned values are valid only while data stays alive and unmodified. The footgun:

func handler(w http.ResponseWriter, r *http.Request) {
    buf := pool.Get(); defer pool.Put(buf) // recycled!
    io.ReadFull(r.Body, buf)
    var msg Msg
    qdf.Unmarshal(buf, &msg, qdf.WithNoCopy())
    queue <- msg // msg.Field aliases buf … which is about to be reused → garbage
}

That's a use-after-free the race detector won't catch (it's not a data race — it's manual memory). So WithNoCopy is opt-in by design: perfect for read-and-discard over a buffer you own (a file, an mmap, a batch you process then drop), wrong for a pooled request body that outlives the call. Works on the reflection path, codegen, and streams.

Lever 2 · Decode in struct order

The encoder writes fields in struct-declaration order, so on decode the next wire field is almost always the next struct field. The decoder keeps a cursor and tries the expected field first — one string compare — before falling back to a map lookup. A profile of a wide-struct decode had ~40% of time in mapaccess1_faststr + the hash; the cursor removes that on the common path. The map stays as the fallback, so out-of-order, partial, and unknown fields still decode correctly — you just pay the lookup for the ones that actually arrive out of order.

Lever 3 · Lazy, pooled state

Decoders come from a sync.Pool, and their machinery — the intern table, scratch slices — allocates only on first use. A plain struct decode never touches the intern table, so it never pays for it. (Concretely: moving that table behind a lazily-allocated pointer cut a chunk of per-call overhead, because the codegen path builds a fresh decoder per nested value and was zeroing ~4 KiB of table it never used.)

Lever 4 · The biggest win: don't decode at all

Everything from §3 lands here too. Selective decode skips whole columns via the index and never rebuilds filtered rows. If your read pattern is "a few columns of a big batch," the fastest qdf decode is the one that touches almost none of the bytes. No micro-optimization beats not doing the work.

For the last drop: codegen

//go:generate qdfgen -type Event,Batch .

qdfgen emits concrete methods using only the public API — no reflect at runtime, no descriptor lookup. The generated decoder is a flat key switch (and it threads noCopy, so zero-copy works on generated types too):

func (v *Sample) UnmarshalQDFOpts(src []byte, noCopy bool) (int, error) {
    d := qdf.NewDecoderOnBuf(src)
    if noCopy { d.SetNoCopy(true) }
    n, err := d.ReadMapHeader()
    // …
    for i := 0; i < n; i++ {
        kb, _ := d.ReadStringBytes()
        switch string(kb) { // no alloc: compiler special-cases switch string([]byte)
        case "name": { rv, _ := d.ReadString(); v.Name = rv }
        case "age":  { rv, _ := d.ReadInt();    v.Age  = int(rv) }
        // …
        }
    }
}

On a fixed schema that's up to 8.5× faster decode than encoding/json.

And on encode, AppendMarshal hands you buffer ownership for zero per-call allocation:

out = out[:0]
out, _ = qdf.AppendMarshal(out, v, qdf.OptBalanced) // reuse your own buffer

The mental model: encode allocations are constant (a flat 3, output buffer pooled); decode allocations scale with how much you ask for. So the two levers that matter are alias-instead-of-copy (WithNoCopy) and ask-for-less (selective decode).

5. Benchmarks, and how they're measured

2019 i7-9750H, Go 1.26. Wire sizes are deterministic. Latencies are median of 6 runs; throughput claims use benchstat over ≥10 interleaved runs so a single warm/cold run can't lie. Everything reproducible from the bench/ module — a separate module so competitor deps (protobuf, vmihailenco/msgpack, flatbuffers) stay out of the core, which has zero dependencies:

cd bench
go test -run='^$' -bench Decode -benchmem -count=10 | tee new.txt
benchstat old.txt new.txt

Wire size vs the field (bytes, lower is better)

fixture	json	msgpack	protobuf	qdf balanced	qdf compress
OTLP 4×512	1 027 033	793 192	561 860	240 686	179 181
Logs 1024	245 037	193 476	156 479	89 631	62 149
RTB 1024	559 294	428 404	327 700	258 167	203 360
Events 1024	122 857	84 712	64 978	39 650	39 639
IoT 32×256	469 058	224 534	207 562	158 474	148 177

Smaller than protobuf on every batch: OTLP −68%, Logs −60%, Events −39%, RTB −38%, IoT −29%. Because qdf compresses across records and protobuf doesn't. That's the entire gap.

Throughput

workload	result
Decode vs `encoding/json`	4–9× faster across payloads (2–7× vs msgpack)
Numeric/bool slices (QPack)	5× smaller than json, 21× faster encode, 80× faster decode
SIMD bit-unpack (AVX2/NEON)	22–53× over scalar, ~50 GB/s (memory-bound)
~150 MiB realistic payload (Dense)	7.5× faster encode, 8.1× faster decode than json
Encode (Fast, pooled)	~1.1 GB/s, 3 allocs/op — vs ~1000 allocs/op for json & msgpack
Zero-copy decode (string batch)	7002 → 3 allocs, −38% B/op, ~1.7× faster
Codegen decode	up to 8.5× over json on a fixed schema
Selective decode (few columns)	~5× faster & ~5× less memory than full decode

Note the asymmetry: encode is a flat 3 allocations no matter the payload; decode allocations scale with how much you ask for — which is exactly why WithNoCopy and selective decode matter.

6. Which knob, when

One Options bitmask on the encode side. You never pass options to Unmarshal — it reads the header and handles whatever it gets.

Option	Reach for it when
`OptSpeed`	Hot path, single messages, sub-µs latency. msgpack-shaped. The drop-in `encoding/json` replacement. Also: use it if you hash/sign the bytes.
`OptBalanced`	Default for batches: Dense interning + adaptive numeric codecs. Big wire win, still fast.
`OptBalanced\	OptColumnIndex`
`OptCompression`	Cold storage. Adds Gorilla/ALP + rANS. Smallest wire; encode slower — write-once-read-rarely.
`WithNoCopy()`	Read-mostly over a buffer you own and won't mutate. Near-zero-alloc decode.
`AppendMarshal`	Own the output buffer for zero per-call allocation.
`qdfgen`	Fixed schema, every nanosecond counts — reflection-free generated methods.

The presets are just bundles of bits you'd compose by hand anyway:

const (
    OptSpeed       = 0 // Fast mode, nothing on
    OptBalanced    = OptDense | OptQPack | OptShapeIntern | OptPairPred | OptMTF
    OptCompression = OptBalanced | OptGorillaFloat | OptRANS
)

One axis, left to right: lowest CPU → smallest bytes. And every step is never-larger, so moving right never inflates a buffer.

OptSpeed  ──▶  OptBalanced  ──▶  OptCompression
fastest        −60% vs proto     smallest
≈ msgpack      still fast        slower encode

The same Logs-1024 batch, measured: json 245 KB → msgpack 193 KB → protobuf 156 KB → OptBalanced 90 KB → OptCompression 62 KB.

Two build tags — free performance, off by default

Orthogonal to Options: these change the generated machine code, not the wire. Same bytes, faster processing.

-tags qdf_simd — AVX2 (amd64) / NEON (arm64) bit-pack kernels, byte-identical output, runtime CPUID gate + scalar fallback. 22–53× over scalar. If you run OptBalanced/OptCompression on numeric data, turn it on — it's free.
-tags qdf_reflect2 — swaps reflect.MakeSlice/MakeMapWithSize/New for modern-go/reflect2 unsafe equivalents → smaller decode allocations on map/slice-heavy payloads. The one honesty note: this is the single opt-out from zero-dependency. Worth it if your data is map/slice-dense and you're not on codegen.

go build -tags "qdf_simd qdf_reflect2" ./... // combine freely

7. Streaming

enc := qdf.NewStreamEncoder(w, qdf.Dense)
for _, ev := range events { _ = enc.Encode(&ev) }
enc.Close()

dec := qdf.NewStreamDecoder(r)
for {
    var ev Event
    if err := dec.Decode(&ev); err == io.EOF { break } else if err != nil { return err }
}

The header is written once; the Dense intern table is shared across messages, so a 10k-row log pays for each distinct key once (the second message's "region":"eu-west-1" is a back-reference into the first). Each message is length-framed — a uvarint byte-count precedes its body — so a message of any size round-trips, even across a reader that hands you one byte per Read, and io.EOF marks the end cleanly. SetNoCopy works here too; aliases stay valid for the stream's lifetime because the window is never compacted.

QDF hdr   │ len₁ · msg₁ │ len₂ · msg₂ │ … EOF
5B once   │ uvarint+body│ uvarint+body│

Streaming and columnar are the two halves of the design: streaming is row-by-row for unbounded feeds; columnar is a complete batch you can query. So the whole-batch features — OptColumnIndex, Where/Select, OptRANS — aren't part of streaming, by design.

8. Where it doesn't win (the honest part)

OptSpeed wire ≈ msgpack — the speed tier skips columnar compression on purpose. Use OptBalanced when you want the bytes back.
The compression tier's encode is slower (Gorilla/ALP cost real CPU). It's a storage play, not a hot path.
protobuf and flatbuffers still win raw single-message decode and single-tiny-message size — generated code and zero-copy field access are hard to beat when there's no batch to amortize over. Different tool for "one small message, decoded whole, hot."

qdf's sweet spot is batches of structured records you want small on the wire and partially readable later: telemetry, logging, metrics, analytics, event sourcing.

Try it

go get github.com/alex60217101990/qdf

Pure Go, zero dependencies — nothing to vendor, no schema compiler in your pipeline. Swap it in where you use encoding/json, flip a batch path to OptBalanced|OptColumnIndex, read back just the columns you need — then go stare at your allocation graph.

Repo: github.com/alex60217101990/qdf
Full API reference: pkg.go.dev/github.com/alex60217101990/qdf

If the query model or the codec picker is useful to you, a ⭐ on the repo helps others find it. And if you find a payload shape where qdf loses that it shouldn't — open an issue with the fixture. That's the most useful bug report there is.