<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Aleksandr Yershov</title>
    <description>The latest articles on DEV Community by Aleksandr Yershov (@alex_602).</description>
    <link>https://dev.to/alex_602</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3966198%2F7057f0fe-2444-4505-be34-25d0b7615bdf.png</url>
      <title>DEV Community: Aleksandr Yershov</title>
      <link>https://dev.to/alex_602</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alex_602"/>
    <language>en</language>
    <item>
      <title>qdf: a Go serializer that decodes less, packs harder, and lets you query the bytes</title>
      <dc:creator>Aleksandr Yershov</dc:creator>
      <pubDate>Wed, 03 Jun 2026 12:02:53 +0000</pubDate>
      <link>https://dev.to/alex_602/qdf-a-go-serializer-that-decodes-less-packs-harder-and-lets-you-query-the-bytes-2a39</link>
      <guid>https://dev.to/alex_602/qdf-a-go-serializer-that-decodes-less-packs-harder-and-lets-you-query-the-bytes-2a39</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR for the impatient.&lt;/strong&gt; &lt;code&gt;qdf&lt;/code&gt; is a schemaless Go serializer (struct tags, no &lt;code&gt;.proto&lt;/code&gt;). On real batches it's up to &lt;strong&gt;68% smaller than protobuf&lt;/strong&gt;, decodes &lt;strong&gt;4–9× faster than &lt;code&gt;encoding/json&lt;/code&gt;&lt;/strong&gt;, ships hand-written &lt;strong&gt;AVX2/NEON&lt;/strong&gt; bit-packing at ~50 GB/s, and does one thing no other mainstream Go serializer does: it can run &lt;code&gt;SELECT … WHERE …&lt;/code&gt; over a &lt;code&gt;[]byte&lt;/code&gt; and &lt;strong&gt;decode only the columns and rows you asked for&lt;/strong&gt;. Pure Go, zero dependencies. &lt;a href="https://github.com/alex60217101990/qdf" rel="noopener noreferrer"&gt;&lt;code&gt;github.com/alex60217101990/qdf&lt;/code&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This is the engineering deep-dive, not the marketing page. We're going to look at actual hexdumps, the codec picker's never-larger guarantee, the twin-bitmask three-valued predicate engine, and a profiler-driven argument about why your decode path is slow for a reason you probably haven't measured. If you write Go services that serialize the same five shapes forever — logs, events, metrics, RTB bids, OTLP spans — this is for you.&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem nobody's format actually solves
&lt;/h2&gt;

&lt;p&gt;Every binary serializer makes you pick two of three:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;schemaless&lt;/th&gt;
&lt;th&gt;small wire&lt;/th&gt;
&lt;th&gt;fast / cheap&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;encoding/json&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌ (allocates a mountain)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;msgpack&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;⚠️ (per-record)&lt;/td&gt;
&lt;td&gt;⚠️&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;protobuf / flatbuffers&lt;/td&gt;
&lt;td&gt;❌ (&lt;code&gt;.proto&lt;/code&gt; + codegen)&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;JSON is universal and schemaless and burns CPU and GC like it's free. msgpack is smaller but you still decode the &lt;em&gt;whole&lt;/em&gt; blob to read one field. protobuf and flatbuffers are fast and compact — right up until you're maintaining &lt;code&gt;.proto&lt;/code&gt; files and a codegen step for what used to be a plain struct.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;qdf&lt;/code&gt; is an attempt to refuse the tradeoff: &lt;strong&gt;self-describing wire&lt;/strong&gt; (decode straight into a struct, no schema), &lt;strong&gt;protobuf-class sizes on batches&lt;/strong&gt;, genuinely extreme decode speed, &lt;strong&gt;and&lt;/strong&gt; a columnar mode you can query. Let's see how, byte by byte.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;Event&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;TS&lt;/span&gt;    &lt;span class="kt"&gt;int64&lt;/span&gt;  &lt;span class="s"&gt;`qdf:"ts"`&lt;/span&gt;
    &lt;span class="n"&gt;Level&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt; &lt;span class="s"&gt;`qdf:"level"`&lt;/span&gt;
    &lt;span class="n"&gt;Code&lt;/span&gt;  &lt;span class="kt"&gt;int32&lt;/span&gt;  &lt;span class="s"&gt;`qdf:"code"`&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;qdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Marshal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;qdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OptBalanced&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c"&gt;// []Event -&amp;gt; []byte&lt;/span&gt;
&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;back&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;Event&lt;/span&gt;
&lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;qdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Unmarshal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;b&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;back&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Struct tags name fields, exactly like &lt;code&gt;json:&lt;/code&gt;. No registry, no generated types to keep in sync. &lt;strong&gt;The decoder figures out mode, codecs and compression from the wire itself&lt;/strong&gt; — you never pass options to &lt;code&gt;Unmarshal&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. The wire format in one look
&lt;/h2&gt;

&lt;p&gt;A qdf buffer is a &lt;strong&gt;5-byte header + a tagged body&lt;/strong&gt;. That's the whole envelope.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;51 44 46   01    XX        [ tagged body … ]
'Q' 'D''F' ver  flags       bytes 5 … N
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fumzq102nhrqcv9068jis.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fumzq102nhrqcv9068jis.png" alt="qdf wire format: 5-byte header + tagged body" width="799" height="108"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The flags byte is a tiny bitmap telling the decoder which &lt;em&gt;dialect&lt;/em&gt; the body speaks, so it can fast-path or reject before parsing a single value:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;FlagDense&lt;/code&gt; (&lt;code&gt;0x01&lt;/code&gt;) — body uses the Dense intern dialect (back-reference tags).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;FlagQPack&lt;/code&gt; (&lt;code&gt;0x02&lt;/code&gt;) — body may carry the QPack numeric/bool codec tags.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;FlagRANS&lt;/code&gt; (&lt;code&gt;0x04&lt;/code&gt;) — body is rANS-compressed; decompress first.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;FlagColIndex&lt;/code&gt; (&lt;code&gt;0x08&lt;/code&gt;) — a columnar payload carries a per-column length index (this is what makes selective decode an O(1) skip).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;strong&gt;base tag space is msgpack-shaped&lt;/strong&gt; — fixint, fixstr, fixarr, typed scalars, str/bin/arr/map in 8/16/32 widths, negfixint. On top of that sit the Dense back-reference tags and the QPack codec tags. That base layer is why a Fast-mode qdf buffer is about as small as msgpack and just as quick; the extra tags are where qdf pulls ahead on batches.&lt;/p&gt;

&lt;h3&gt;
  
  
  An actual buffer, byte for byte
&lt;/h3&gt;

&lt;p&gt;Encode one &lt;code&gt;&amp;amp;Event{TS:7, Level:"ERR", Code:500}&lt;/code&gt; with &lt;code&gt;OptSpeed&lt;/code&gt; → &lt;strong&gt;29 bytes&lt;/strong&gt;, every one accounted for:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;51 44 46 01 00              QDF, ver 1, flags 0x00 (Fast)
d5 03                       map, 3 fields
82 74 73 07                 "ts"  -&amp;gt; fixint 7
85 6c 65 76 65 6c 83 45 52 52   "level" -&amp;gt; fixstr "ERR"
84 63 6f 64 65 c4 f4 01     "code" -&amp;gt; uint16 0x01F4 (500)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two details that tell you how the encoder thinks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;It picked the narrowest tag that holds the value.&lt;/strong&gt; &lt;code&gt;500&lt;/code&gt; went out as a 2-byte &lt;code&gt;uint16&lt;/code&gt;, not a 4-byte &lt;code&gt;int32&lt;/code&gt;. The picker always reaches for the smallest tag, per value.&lt;/li&gt;
&lt;li&gt;There's &lt;strong&gt;no schema anywhere&lt;/strong&gt;. The keys &lt;code&gt;ts&lt;/code&gt;/&lt;code&gt;level&lt;/code&gt;/&lt;code&gt;code&lt;/code&gt; are in the bytes. That's the cost of being schemaless on a single message — and exactly what Dense mode erases on a batch.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Flip to &lt;code&gt;OptBalanced&lt;/code&gt; on a slice of these and the repeated keys (&lt;code&gt;ts&lt;/code&gt;/&lt;code&gt;level&lt;/code&gt;/&lt;code&gt;code&lt;/code&gt;) and repeated values (&lt;code&gt;"ERR"&lt;/code&gt;) collapse to 1-byte back-references after first sight. Which brings us to the encoder.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Encode: it measures, then packs
&lt;/h2&gt;

&lt;p&gt;qdf doesn't pick one scheme and pray. The encode pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;value → typeDesc cache → columnar transpose → per-column codec picker
      → Dense intern → rANS (opt-in) → []byte
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fte15b5lmq5qaeaobvsd6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fte15b5lmq5qaeaobvsd6.png" alt="qdf encode pipeline stages" width="800" height="1948"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reflection runs once per type, ever.&lt;/strong&gt; The first call for a type builds a &lt;em&gt;type descriptor&lt;/em&gt; — a flat array of encode/decode closures over &lt;code&gt;unsafe&lt;/code&gt; field offsets — and caches it in a &lt;code&gt;sync.Map&lt;/code&gt;. Every later call touches only those closures: no &lt;code&gt;reflect.Value&lt;/code&gt; churn, no per-field type switch on the hot path.&lt;/p&gt;

&lt;h3&gt;
  
  
  The codec picker and the never-larger rule
&lt;/h3&gt;

&lt;p&gt;For every numeric/bool slice the encoder runs a &lt;strong&gt;cheap bounded probe&lt;/strong&gt; and emits the smallest of a family. The comparison &lt;strong&gt;includes the raw form&lt;/strong&gt;, so if nothing wins it falls back — &lt;em&gt;turning compression on can never inflate a slice.&lt;/em&gt; This "never-larger by construction" property is the whole reason you can flip &lt;code&gt;OptBalanced&lt;/code&gt; on blindly.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;codec&lt;/th&gt;
&lt;th&gt;idea&lt;/th&gt;
&lt;th&gt;wins on&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;FOR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;store &lt;code&gt;value − min&lt;/code&gt;, bit-pack to width of &lt;code&gt;max−min&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;bounded ranges (HTTP codes 200–504 → ~10 bits, not 32)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Delta+FOR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;FOR over consecutive differences&lt;/td&gt;
&lt;td&gt;monotonic-ish columns: timestamps, IDs, offsets&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RLE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;(value, run-length)&lt;/code&gt; pairs&lt;/td&gt;
&lt;td&gt;long runs: status, enum, sparse flags&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dictionary&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;distinct table + bit-packed indices (&lt;code&gt;ceil(log2 d)&lt;/code&gt; bits/row)&lt;/td&gt;
&lt;td&gt;low cardinality, incl. string columns (level, region)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Patched FOR&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;FOR + an exception list for outliers&lt;/td&gt;
&lt;td&gt;mostly-narrow columns with a few spikes&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzynszovh6hu1y2edwlr8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzynszovh6hu1y2edwlr8.png" alt="QPack codec family and the never-larger picker" width="800" height="549"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Delta+FOR, with the actual bytes
&lt;/h4&gt;

&lt;p&gt;Take &lt;code&gt;[]int64{1000, 1001, …, 1009}&lt;/code&gt; — ten 8-byte integers, &lt;strong&gt;80 bytes raw&lt;/strong&gt;. &lt;code&gt;Marshal(ints, OptQPack)&lt;/code&gt; gives &lt;strong&gt;12 bytes total&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;00000000  51 44 46 01 02 e6 07 00  d0 0f 02 0a   |QDF.........|
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Header is 5 bytes (&lt;code&gt;flags 0x02&lt;/code&gt; = QPack), so the body is &lt;strong&gt;7 bytes for ten int64s&lt;/strong&gt;. Codec &lt;code&gt;0xE6&lt;/code&gt; = Delta+FOR: it stored the first value, the minimum delta, and the residual deltas bit-packed. Since every delta is exactly &lt;code&gt;1&lt;/code&gt;, the residuals collapse to almost nothing.&lt;/p&gt;

&lt;p&gt;That's the mechanism behind the headline &lt;strong&gt;512× compression on monotonic timestamp vectors&lt;/strong&gt; — a clock column is the perfect case: large absolute values, tiny constant deltas.&lt;/p&gt;

&lt;h3&gt;
  
  
  SIMD bit-packing — same wire, faster code
&lt;/h3&gt;

&lt;p&gt;The bit-pack/unpack kernels are &lt;strong&gt;hand-written assembly: AVX2 on amd64, NEON on arm64&lt;/strong&gt;, and they emit &lt;strong&gt;byte-identical output to the scalar path&lt;/strong&gt;. Tests assert &lt;code&gt;scalar ≡ SIMD&lt;/code&gt; bit-for-bit. So &lt;code&gt;-tags qdf_simd&lt;/code&gt; is purely faster, never a different wire — runtime CPUID gate, scalar fallback on anything without AVX2.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;22–53× over scalar&lt;/strong&gt; at byte-aligned widths&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~50 GB/s&lt;/strong&gt; unpack (memory-bound there, not compute-bound)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you run &lt;code&gt;OptBalanced&lt;/code&gt;/&lt;code&gt;OptCompression&lt;/code&gt; over numeric data, this build tag is free money:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;go build &lt;span class="nt"&gt;-tags&lt;/span&gt; qdf_simd ./...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Implementation note for the SIMD-curious: the decode kernels lean on &lt;code&gt;VPMOVZX&lt;/code&gt; widen-loads and &lt;code&gt;VPBROADCASTQ&lt;/code&gt;+&lt;code&gt;VPSRLVQ&lt;/code&gt; variable-per-lane shifts (a per-offset shift table picks the bit offset for each lane); encode uses &lt;code&gt;VPSHUFB&lt;/code&gt; byte-gather and &lt;code&gt;VPSLLVQ&lt;/code&gt;+lane-OR. On arm64, several of those have no direct Plan9 mnemonic and get hand-encoded via &lt;code&gt;WORD&lt;/code&gt;. It's the kind of code where "byte-identical to scalar" is a property you &lt;em&gt;test&lt;/em&gt;, not hope for.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu3x0n8y5lf2kyb57s4gc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu3x0n8y5lf2kyb57s4gc.png" alt="qdf_simd build tag: AVX2 (amd64) and NEON (arm64) kernels per operation, with a pure-Go scalar fallback" width="800" height="855"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The four-layer Dense dialect (strings &amp;amp; structure)
&lt;/h3&gt;

&lt;p&gt;Repeated strings and field names are where batch formats bleed. Dense mode stacks four mechanisms so the &lt;em&gt;second&lt;/em&gt; occurrence of a value is nearly free. Take &lt;code&gt;[]string{"eu-west-1","eu-west-1","eu-west-1"}&lt;/code&gt; under &lt;code&gt;OptBalanced&lt;/code&gt; — &lt;strong&gt;19 bytes&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;00000000  51 44 46 01 03 a3 e0 09  65 75 2d 77 65 73 74 2d  |QDF.....eu-west-|
00000010  31 e8 e8                                          |1..|
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;bytes&lt;/th&gt;
&lt;th&gt;meaning&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;51 44 46 01 03&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;header, flags &lt;code&gt;0x03&lt;/code&gt; (Dense | QPack)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;a3&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;fixarr, 3 elements&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;e0 09 65…31&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1st value: intern declaration — tag + len 9 + &lt;code&gt;"eu-west-1"&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;e8&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2nd value: one-byte back-reference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;e8&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3rd value: one byte again&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbylfelh6i5o27qpbawax.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbylfelh6i5o27qpbawax.png" alt="Dense interning: first sight stored, repeats become 1-byte back-references" width="800" height="1545"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;First &lt;code&gt;"eu-west-1"&lt;/code&gt; costs 11 bytes; each repeat costs &lt;strong&gt;1&lt;/strong&gt;. That's the whole game on telemetry, where &lt;code&gt;region&lt;/code&gt;/&lt;code&gt;service&lt;/code&gt;/&lt;code&gt;level&lt;/code&gt; repeat across thousands of rows. The four layers producing those one-byte refs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Intern table&lt;/strong&gt; — first sight stored, assigned an id; later sights become a varint reference.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Move-to-front&lt;/strong&gt; — the hot set resolves in 1–2 bytes via a small MRU ring (recent values get the shortest codes).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Markov-0 "same as last"&lt;/strong&gt; — a value equal to the previous one is a single repeat tag (the &lt;code&gt;e8&lt;/code&gt; above).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Markov-1 pair predictor&lt;/strong&gt; — if &lt;code&gt;"GET"&lt;/code&gt; is usually followed by &lt;code&gt;"/health"&lt;/code&gt;, the predicted successor collapses too.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4jr2psjulj4j0fp7j4l2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4jr2psjulj4j0fp7j4l2.png" alt="The four Dense reference predictors, tried in order: Markov-0 repeat, Markov-1 pair, MTF rank, raw state-ref" width="800" height="2467"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Floats get &lt;strong&gt;Gorilla&lt;/strong&gt; (lossless XOR coding over &lt;code&gt;math.Float64bits&lt;/code&gt; — bit-exact for &lt;code&gt;NaN&lt;/code&gt;/&lt;code&gt;±Inf&lt;/code&gt;/&lt;code&gt;−0.0&lt;/code&gt;, never &lt;code&gt;==&lt;/code&gt;) and &lt;strong&gt;ALP&lt;/strong&gt; (decimal-mantissa for quantized metrics/prices, with an exception list for anything that doesn't round-trip exactly). The opt-in order-0 &lt;strong&gt;rANS&lt;/strong&gt; pass is the final never-larger squeeze for cold storage.&lt;/p&gt;

&lt;h3&gt;
  
  
  The structural win (and the gotcha)
&lt;/h3&gt;

&lt;p&gt;Here's why qdf lands smaller than protobuf on real batches: &lt;strong&gt;it dedups and compresses &lt;em&gt;across&lt;/em&gt; records.&lt;/strong&gt; protobuf, msgpack, json and flatbuffers encode each record independently, so a repeated string or a smooth float series re-pays its cost every single row. qdf pays once per batch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gotcha #1:&lt;/strong&gt; that cross-record win &lt;em&gt;needs a batch&lt;/em&gt;. On a single small message there's nothing to dedup, so &lt;code&gt;OptBalanced ≈ OptSpeed ≈ msgpack&lt;/code&gt; in size — use &lt;code&gt;OptSpeed&lt;/code&gt; there and skip the Dense bookkeeping.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gotcha #2:&lt;/strong&gt; the Dense wire embeds intern/shape ids that depend on &lt;strong&gt;emission order&lt;/strong&gt;, so two semantically-equal payloads can differ byte-for-byte. If you hash or sign the bytes, encode with &lt;code&gt;OptSpeed&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. The headline: read less than the whole message
&lt;/h2&gt;

&lt;p&gt;Hand qdf a &lt;code&gt;[]struct&lt;/code&gt; and it &lt;strong&gt;transposes&lt;/strong&gt; rows into columns — think Parquet, but automatic and still self-describing. Each column then gets the codec that fits it: timestamps go Delta+FOR, an enum-ish &lt;code&gt;level&lt;/code&gt; goes dictionary, a run-heavy &lt;code&gt;code&lt;/code&gt; goes RLE.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rows ([]Event)              columns (each its own codec)
┌────┬───────┬──────┐       ┌──────────┬────────┬──────┐
│ ts │ level │ code │  →    │ ts ts ts │ level… │ code…│
│ …  │  …    │  …   │       │ Delta+FOR│  dict  │ RLE  │
└────┴───────┴──────┘       └──────────┴────────┴──────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6i2pqt2lmefqfuxmum9e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6i2pqt2lmefqfuxmum9e.png" alt="Transpose: []struct rows become per-column codecs plus a length index" width="800" height="507"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With &lt;code&gt;OptColumnIndex&lt;/code&gt; the encoder also writes, right after the shape declaration, a &lt;strong&gt;fixed-width index: one &lt;code&gt;uint32&lt;/code&gt; byte-length per column.&lt;/strong&gt; That index is the key — it lets the decoder compute each column's start offset and &lt;strong&gt;jump straight past any column it doesn't need&lt;/strong&gt;, without parsing a byte of it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fas9p58a3l73ojqdocyv3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fas9p58a3l73ojqdocyv3.png" alt="tagColStruct body layout: row count, shape decl, optional column-length index, then column bodies" width="800" height="2102"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Querying the bytes
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;qdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Marshal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;qdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OptBalanced&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;qdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OptColumnIndex&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c"&gt;// "SELECT ts, code WHERE level='ERROR' AND code&amp;gt;=500" — over a []byte.&lt;/span&gt;
&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;Hot&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;TS&lt;/span&gt;   &lt;span class="kt"&gt;int64&lt;/span&gt; &lt;span class="s"&gt;`qdf:"ts"`&lt;/span&gt;
    &lt;span class="n"&gt;Code&lt;/span&gt; &lt;span class="kt"&gt;int32&lt;/span&gt; &lt;span class="s"&gt;`qdf:"code"`&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;hot&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;Hot&lt;/span&gt;
&lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;qdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Unmarshal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;hot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;qdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"level"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;"ERROR"&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
    &lt;span class="n"&gt;qdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"code"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="kt"&gt;int32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="m"&gt;500&lt;/span&gt; &lt;span class="p"&gt;}))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What the decoder actually does, in order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Read the shape + column index.&lt;/strong&gt; Now it knows where every column starts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Filter columns&lt;/strong&gt; — decode &lt;em&gt;only&lt;/em&gt; the columns named in a predicate (&lt;code&gt;level&lt;/code&gt;, &lt;code&gt;code&lt;/code&gt;). Run each predicate across its whole column to produce a per-row bitmask.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Combine the masks&lt;/strong&gt; (AND here) into the surviving-row set.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Project&lt;/strong&gt; — for the columns &lt;code&gt;Hot&lt;/code&gt; wants (&lt;code&gt;ts&lt;/code&gt;, &lt;code&gt;code&lt;/code&gt;), materialize values only at the surviving rows. &lt;code&gt;level&lt;/code&gt; was read to filter, then &lt;strong&gt;dropped&lt;/strong&gt; because &lt;code&gt;Hot&lt;/code&gt; doesn't contain it. Every other column is &lt;strong&gt;skipped via the index&lt;/strong&gt; — its bytes are never parsed.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ag0d9ljup4ukr9k52c6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ag0d9ljup4ukr9k52c6.png" alt="Selective decode: read shape + index, one forward pass, decode only predicate/projected columns, scatter matched rows" width="800" height="1947"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The predicate engine: twin bitmasks + SQL three-valued logic
&lt;/h3&gt;

&lt;p&gt;It isn't just AND-of-equals. &lt;code&gt;And&lt;/code&gt;, &lt;code&gt;Or&lt;/code&gt;, &lt;code&gt;Not&lt;/code&gt; compose into a real predicate tree — and the tricky part is &lt;strong&gt;nullable columns&lt;/strong&gt;: in SQL, a comparison against &lt;code&gt;NULL&lt;/code&gt; is neither true nor false, it's &lt;code&gt;UNKNOWN&lt;/code&gt;. qdf gets this right with &lt;strong&gt;twin bitmasks per node&lt;/strong&gt;: a &lt;code&gt;T&lt;/code&gt; mask (rows definitely true) and an &lt;code&gt;F&lt;/code&gt; mask (rows definitely false). Anything in neither is &lt;code&gt;UNKNOWN&lt;/code&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Leaf:&lt;/strong&gt; run the predicate per present row → fills &lt;code&gt;T&lt;/code&gt;; &lt;code&gt;F = present &amp;amp;^ T&lt;/code&gt; (present-but-not-true). Absent (&lt;code&gt;nil&lt;/code&gt;) rows land in neither — &lt;code&gt;UNKNOWN&lt;/code&gt;, for free.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AND:&lt;/strong&gt; &lt;code&gt;T = T₁ &amp;amp; T₂&lt;/code&gt;, &lt;code&gt;F = F₁ | F₂&lt;/code&gt; (false if any child is false — even if another is unknown).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OR:&lt;/strong&gt; &lt;code&gt;T = T₁ | T₂&lt;/code&gt;, &lt;code&gt;F = F₁ &amp;amp; F₂&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NOT:&lt;/strong&gt; swap &lt;code&gt;T&lt;/code&gt; and &lt;code&gt;F&lt;/code&gt; (unknown stays unknown).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fby036vzamombgljl6usa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fby036vzamombgljl6usa.png" alt="Twin-bitmask predicate tree evaluating AND/OR/NOT with SQL three-valued logic" width="799" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The final result keeps only rows in the &lt;strong&gt;root &lt;code&gt;T&lt;/code&gt; mask&lt;/strong&gt; — TRUE, never FALSE, never UNKNOWN — which is exactly SQL &lt;code&gt;WHERE&lt;/code&gt; semantics.&lt;/p&gt;

&lt;p&gt;A neat optimization: a subtree with &lt;strong&gt;no nullable leaves can't produce UNKNOWN&lt;/strong&gt;, so qdf skips materializing its &lt;code&gt;F&lt;/code&gt; mask entirely and treats "not true" as the complement — one fewer pass over the rows.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;qdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Unmarshal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;hot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;qdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Or&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;qdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"level"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;"ERROR"&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
        &lt;span class="n"&gt;qdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;And&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;qdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"code"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="kt"&gt;int32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="m"&gt;500&lt;/span&gt; &lt;span class="p"&gt;}),&lt;/span&gt;
            &lt;span class="n"&gt;qdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Not&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;qdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"level"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s"&gt;"DEBUG"&lt;/span&gt; &lt;span class="p"&gt;})),&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The predicate is called &lt;strong&gt;once per row against the native typed value&lt;/strong&gt; — &lt;code&gt;func(int32) bool&lt;/code&gt;, &lt;code&gt;func(string) bool&lt;/code&gt; — with &lt;strong&gt;zero interface boxing&lt;/strong&gt;. Pure projection without a filter is just &lt;code&gt;Select("ts","code")&lt;/code&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;No mainstream Go serializer does this.&lt;/strong&gt; json, msgpack, protobuf, gob — all decode the whole message before you can read one field. For "store a wide batch, read a few columns or filter rows later," qdf is the only one that reads &lt;em&gt;less than everything&lt;/em&gt;.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Concretely, on a wide batch at low selectivity (i7-9750H):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;~5× faster&lt;/strong&gt; than full decode (projection)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~5× less memory&lt;/strong&gt; than full decode&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~2.5× faster&lt;/strong&gt; than decode-everything-then-filter&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When it applies: you need &lt;code&gt;OptColumnIndex&lt;/code&gt; at encode time, a &lt;code&gt;[]struct&lt;/code&gt; batch, and flat-ish fields. The bigger and wider the batch and the more selective the query, the larger the win. It's the columnar-warehouse pattern brought to a plain Go &lt;code&gt;[]byte&lt;/code&gt; — no database, no schema. (It is &lt;em&gt;not&lt;/em&gt; for single messages or streaming — that's the row-by-row half of the design.)&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Decode: the fastest work is the work you skip
&lt;/h2&gt;

&lt;p&gt;Here's the claim that should change how you think about serializer performance:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Profile any serializer's decode and the truth is the same: it's allocation-bound, not CPU-bound.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Run &lt;code&gt;go test -memprofile&lt;/code&gt; on a string-heavy decode and look at &lt;code&gt;-alloc_objects&lt;/code&gt;. On qdf's row path it's almost entirely &lt;strong&gt;one call: &lt;code&gt;(*Decoder).ReadString&lt;/code&gt;&lt;/strong&gt; — copying string bodies out of the buffer into owned Go strings. Tag walking, bounds checks, type dispatch — rounding error. So the levers that matter aren't clever ALU tricks. They're &lt;strong&gt;don't allocate&lt;/strong&gt; and &lt;strong&gt;don't decode&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lever 1 · Zero-copy decode
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;Event&lt;/span&gt;
&lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;qdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Unmarshal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;qdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithNoCopy&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="c"&gt;// strings alias data, no copy&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;WithNoCopy&lt;/code&gt; returns strings and byte slices that point &lt;em&gt;into&lt;/em&gt; &lt;code&gt;data&lt;/code&gt; instead of copying out. On a string-heavy batch: &lt;strong&gt;~1.7× faster, 7000+ allocations collapse to 3&lt;/strong&gt; (the only one left is the output slice). The decoder is already pooled and its scratch buffers reused, so with aliasing there's essentially nothing left to allocate per value.&lt;/p&gt;

&lt;p&gt;The catch is honest and it's in the name. &lt;strong&gt;The returned values are valid only while &lt;code&gt;data&lt;/code&gt; stays alive and unmodified.&lt;/strong&gt; The footgun:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt; &lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ResponseWriter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;buf&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="k"&gt;defer&lt;/span&gt; &lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c"&gt;// recycled!&lt;/span&gt;
    &lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReadFull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="n"&gt;Msg&lt;/span&gt;
    &lt;span class="n"&gt;qdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Unmarshal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;qdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;WithNoCopy&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;queue&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;-&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="c"&gt;// msg.Field aliases buf … which is about to be reused → garbage&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's a &lt;strong&gt;use-after-free the race detector won't catch&lt;/strong&gt; (it's not a data race — it's manual memory). So &lt;code&gt;WithNoCopy&lt;/code&gt; is opt-in by design: perfect for read-and-discard over a buffer you own (a file, an mmap, a batch you process then drop), wrong for a pooled request body that outlives the call. Works on the reflection path, codegen, and streams.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lever 2 · Decode in struct order
&lt;/h3&gt;

&lt;p&gt;The encoder writes fields in struct-declaration order, so on decode the next wire field is almost always the next struct field. The decoder keeps a &lt;strong&gt;cursor and tries the expected field first — one string compare — before falling back to a map lookup.&lt;/strong&gt; A profile of a wide-struct decode had &lt;strong&gt;~40% of time in &lt;code&gt;mapaccess1_faststr&lt;/code&gt; + the hash&lt;/strong&gt;; the cursor removes that on the common path. The map stays as the fallback, so out-of-order, partial, and unknown fields still decode correctly — you just pay the lookup for the ones that actually arrive out of order.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lever 3 · Lazy, pooled state
&lt;/h3&gt;

&lt;p&gt;Decoders come from a &lt;code&gt;sync.Pool&lt;/code&gt;, and their machinery — the intern table, scratch slices — allocates only on &lt;strong&gt;first use&lt;/strong&gt;. A plain struct decode never touches the intern table, so it never pays for it. (Concretely: moving that table behind a lazily-allocated pointer cut a chunk of per-call overhead, because the codegen path builds a fresh decoder per nested value and was zeroing ~4 KiB of table it never used.)&lt;/p&gt;

&lt;h3&gt;
  
  
  Lever 4 · The biggest win: don't decode at all
&lt;/h3&gt;

&lt;p&gt;Everything from §3 lands here too. Selective decode skips whole columns via the index and never rebuilds filtered rows. If your read pattern is "a few columns of a big batch," the fastest qdf decode is the one that touches almost none of the bytes. No micro-optimization beats not doing the work.&lt;/p&gt;

&lt;h3&gt;
  
  
  For the last drop: codegen
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;//go:generate qdfgen -type Event,Batch .&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;qdfgen&lt;/code&gt; emits concrete methods using &lt;strong&gt;only the public API&lt;/strong&gt; — no reflect at runtime, no descriptor lookup. The generated decoder is a flat key switch (and it threads &lt;code&gt;noCopy&lt;/code&gt;, so zero-copy works on generated types too):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Sample&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;UnmarshalQDFOpts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;noCopy&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;qdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewDecoderOnBuf&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;noCopy&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;SetNoCopy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="no"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReadMapHeader&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="c"&gt;// …&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;kb&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReadStringBytes&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;switch&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;kb&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="c"&gt;// no alloc: compiler special-cases switch string([]byte)&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s"&gt;"name"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;rv&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReadString&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rv&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="s"&gt;"age"&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;rv&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ReadInt&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;    &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Age&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rv&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="c"&gt;// …&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On a fixed schema that's &lt;strong&gt;up to 8.5× faster decode than &lt;code&gt;encoding/json&lt;/code&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;And on encode, &lt;code&gt;AppendMarshal&lt;/code&gt; hands you buffer ownership for zero per-call allocation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="m"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;qdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;AppendMarshal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;qdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OptBalanced&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c"&gt;// reuse your own buffer&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The mental model:&lt;/strong&gt; encode allocations are constant (a flat 3, output buffer pooled); decode allocations scale with how much you ask for. So the two levers that matter are &lt;em&gt;alias-instead-of-copy&lt;/em&gt; (&lt;code&gt;WithNoCopy&lt;/code&gt;) and &lt;em&gt;ask-for-less&lt;/em&gt; (selective decode).&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Benchmarks, and how they're measured
&lt;/h2&gt;

&lt;p&gt;2019 i7-9750H, Go 1.26. Wire sizes are deterministic. Latencies are median of 6 runs; throughput claims use &lt;code&gt;benchstat&lt;/code&gt; over ≥10 interleaved runs so a single warm/cold run can't lie. Everything reproducible from the &lt;code&gt;bench/&lt;/code&gt; module — a &lt;em&gt;separate&lt;/em&gt; module so competitor deps (protobuf, &lt;code&gt;vmihailenco/msgpack&lt;/code&gt;, flatbuffers) stay out of the core, which has &lt;strong&gt;zero dependencies&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;bench
go &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="nt"&gt;-run&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;'^$'&lt;/span&gt; &lt;span class="nt"&gt;-bench&lt;/span&gt; Decode &lt;span class="nt"&gt;-benchmem&lt;/span&gt; &lt;span class="nt"&gt;-count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;10 | &lt;span class="nb"&gt;tee &lt;/span&gt;new.txt
benchstat old.txt new.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Wire size vs the field (bytes, lower is better)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;fixture&lt;/th&gt;
&lt;th&gt;json&lt;/th&gt;
&lt;th&gt;msgpack&lt;/th&gt;
&lt;th&gt;protobuf&lt;/th&gt;
&lt;th&gt;qdf balanced&lt;/th&gt;
&lt;th&gt;qdf compress&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;OTLP 4×512&lt;/td&gt;
&lt;td&gt;1 027 033&lt;/td&gt;
&lt;td&gt;793 192&lt;/td&gt;
&lt;td&gt;561 860&lt;/td&gt;
&lt;td&gt;240 686&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;179 181&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Logs 1024&lt;/td&gt;
&lt;td&gt;245 037&lt;/td&gt;
&lt;td&gt;193 476&lt;/td&gt;
&lt;td&gt;156 479&lt;/td&gt;
&lt;td&gt;89 631&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;62 149&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTB 1024&lt;/td&gt;
&lt;td&gt;559 294&lt;/td&gt;
&lt;td&gt;428 404&lt;/td&gt;
&lt;td&gt;327 700&lt;/td&gt;
&lt;td&gt;258 167&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;203 360&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Events 1024&lt;/td&gt;
&lt;td&gt;122 857&lt;/td&gt;
&lt;td&gt;84 712&lt;/td&gt;
&lt;td&gt;64 978&lt;/td&gt;
&lt;td&gt;39 650&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;39 639&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;IoT 32×256&lt;/td&gt;
&lt;td&gt;469 058&lt;/td&gt;
&lt;td&gt;224 534&lt;/td&gt;
&lt;td&gt;207 562&lt;/td&gt;
&lt;td&gt;158 474&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;148 177&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Smaller than protobuf on &lt;strong&gt;every&lt;/strong&gt; batch: OTLP −68%, Logs −60%, Events −39%, RTB −38%, IoT −29%. Because qdf compresses across records and protobuf doesn't. That's the entire gap.&lt;/p&gt;

&lt;h3&gt;
  
  
  Throughput
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;workload&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Decode vs &lt;code&gt;encoding/json&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;4–9× faster&lt;/strong&gt; across payloads (2–7× vs msgpack)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Numeric/bool slices (QPack)&lt;/td&gt;
&lt;td&gt;5× smaller than json, &lt;strong&gt;21× faster encode, 80× faster decode&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SIMD bit-unpack (AVX2/NEON)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;22–53× over scalar&lt;/strong&gt;, ~50 GB/s (memory-bound)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;~150 MiB realistic payload (Dense)&lt;/td&gt;
&lt;td&gt;7.5× faster encode, &lt;strong&gt;8.1× faster decode&lt;/strong&gt; than json&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Encode (Fast, pooled)&lt;/td&gt;
&lt;td&gt;~1.1 GB/s, &lt;strong&gt;3 allocs/op&lt;/strong&gt; — vs ~1000 allocs/op for json &amp;amp; msgpack&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Zero-copy decode (string batch)&lt;/td&gt;
&lt;td&gt;7002 → &lt;strong&gt;3 allocs&lt;/strong&gt;, −38% B/op, ~1.7× faster&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codegen decode&lt;/td&gt;
&lt;td&gt;up to &lt;strong&gt;8.5× over json&lt;/strong&gt; on a fixed schema&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Selective decode (few columns)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;~5× faster &amp;amp; ~5× less memory&lt;/strong&gt; than full decode&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm49rplojbqzb9akrvg3u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm49rplojbqzb9akrvg3u.png" alt="Where qdf's wins come from, grouped by what each saves: CPU time, memory/allocs, wire size" width="799" height="366"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Note the asymmetry: encode is a flat 3 allocations no matter the payload; decode allocations scale with how much you ask for — which is exactly why &lt;code&gt;WithNoCopy&lt;/code&gt; and selective decode matter.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Which knob, when
&lt;/h2&gt;

&lt;p&gt;One &lt;code&gt;Options&lt;/code&gt; bitmask on the &lt;strong&gt;encode&lt;/strong&gt; side. You never pass options to &lt;code&gt;Unmarshal&lt;/code&gt; — it reads the header and handles whatever it gets.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Option&lt;/th&gt;
&lt;th&gt;Reach for it when&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;OptSpeed&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Hot path, single messages, sub-µs latency. msgpack-shaped. The drop-in &lt;code&gt;encoding/json&lt;/code&gt; replacement. Also: use it if you hash/sign the bytes.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;OptBalanced&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Default for batches:&lt;/strong&gt; Dense interning + adaptive numeric codecs. Big wire win, still fast.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;`OptBalanced\&lt;/td&gt;
&lt;td&gt;OptColumnIndex`&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;OptCompression&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Cold storage. Adds Gorilla/ALP + rANS. Smallest wire; encode slower — write-once-read-rarely.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;WithNoCopy()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Read-mostly over a buffer you own and won't mutate. Near-zero-alloc decode.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;AppendMarshal&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Own the output buffer for zero per-call allocation.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;qdfgen&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fixed schema, every nanosecond counts — reflection-free generated methods.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The presets are just bundles of bits you'd compose by hand anyway:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;OptSpeed&lt;/span&gt;       &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt; &lt;span class="c"&gt;// Fast mode, nothing on&lt;/span&gt;
    &lt;span class="n"&gt;OptBalanced&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;OptDense&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;OptQPack&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;OptShapeIntern&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;OptPairPred&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;OptMTF&lt;/span&gt;
    &lt;span class="n"&gt;OptCompression&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;OptBalanced&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;OptGorillaFloat&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;OptRANS&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One axis, left to right: &lt;strong&gt;lowest CPU → smallest bytes&lt;/strong&gt;. And every step is &lt;em&gt;never-larger&lt;/em&gt;, so moving right never inflates a buffer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;OptSpeed  ──▶  OptBalanced  ──▶  OptCompression
fastest        −60% vs proto     smallest
≈ msgpack      still fast        slower encode
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyufdhqo9omjro6jo3fub.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyufdhqo9omjro6jo3fub.png" alt="Options axis: presets, the bits each turns on, and the CPU-vs-size tradeoff" width="799" height="466"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The same Logs-1024 batch, measured: json &lt;strong&gt;245 KB&lt;/strong&gt; → msgpack 193 KB → protobuf 156 KB → &lt;code&gt;OptBalanced&lt;/code&gt; &lt;strong&gt;90 KB&lt;/strong&gt; → &lt;code&gt;OptCompression&lt;/code&gt; &lt;strong&gt;62 KB&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Two build tags — free performance, off by default
&lt;/h3&gt;

&lt;p&gt;Orthogonal to &lt;code&gt;Options&lt;/code&gt;: these change the generated machine code, not the wire. Same bytes, faster processing.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;-tags qdf_simd&lt;/code&gt; — AVX2 (amd64) / NEON (arm64) bit-pack kernels, byte-identical output, runtime CPUID gate + scalar fallback. &lt;strong&gt;22–53× over scalar.&lt;/strong&gt; If you run &lt;code&gt;OptBalanced&lt;/code&gt;/&lt;code&gt;OptCompression&lt;/code&gt; on numeric data, turn it on — it's free.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-tags qdf_reflect2&lt;/code&gt; — swaps &lt;code&gt;reflect.MakeSlice&lt;/code&gt;/&lt;code&gt;MakeMapWithSize&lt;/code&gt;/&lt;code&gt;New&lt;/code&gt; for &lt;code&gt;modern-go/reflect2&lt;/code&gt; unsafe equivalents → smaller decode allocations on map/slice-heavy payloads. The one honesty note: this is the &lt;strong&gt;single opt-out from zero-dependency&lt;/strong&gt;. Worth it if your data is map/slice-dense and you're not on codegen.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;go build &lt;span class="nt"&gt;-tags&lt;/span&gt; &lt;span class="s2"&gt;"qdf_simd qdf_reflect2"&lt;/span&gt; ./... // combine freely
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  7. Streaming
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;enc&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;qdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewStreamEncoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;w&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;qdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Dense&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ev&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;enc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;enc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;dec&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;qdf&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewStreamDecoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;ev&lt;/span&gt; &lt;span class="n"&gt;Event&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;dec&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;ev&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EOF&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;break&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The header is written once; the &lt;strong&gt;Dense intern table is shared across messages&lt;/strong&gt;, so a 10k-row log pays for each distinct key once (the second message's &lt;code&gt;"region":"eu-west-1"&lt;/code&gt; is a back-reference into the first). &lt;strong&gt;Each message is length-framed&lt;/strong&gt; — a uvarint byte-count precedes its body — so a message of &lt;em&gt;any&lt;/em&gt; size round-trips, even across a reader that hands you one byte per &lt;code&gt;Read&lt;/code&gt;, and &lt;code&gt;io.EOF&lt;/code&gt; marks the end cleanly. &lt;code&gt;SetNoCopy&lt;/code&gt; works here too; aliases stay valid for the stream's lifetime because the window is never compacted.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;QDF hdr   │ len₁ · msg₁ │ len₂ · msg₂ │ … EOF
5B once   │ uvarint+body│ uvarint+body│
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F694axxa24ch36lo3x6n5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F694axxa24ch36lo3x6n5.png" alt="Stream framing: 5-byte header once, then each message length-delimited by a uvarint byte-count" width="776" height="2552"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Streaming and columnar are the two halves of the design: &lt;strong&gt;streaming is row-by-row for unbounded feeds; columnar is a complete batch you can query.&lt;/strong&gt; So the whole-batch features — &lt;code&gt;OptColumnIndex&lt;/code&gt;, &lt;code&gt;Where&lt;/code&gt;/&lt;code&gt;Select&lt;/code&gt;, &lt;code&gt;OptRANS&lt;/code&gt; — aren't part of streaming, by design.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Where it doesn't win (the honest part)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;OptSpeed&lt;/code&gt; wire ≈ msgpack — the speed tier skips columnar compression on purpose. Use &lt;code&gt;OptBalanced&lt;/code&gt; when you want the bytes back.&lt;/li&gt;
&lt;li&gt;The compression tier's encode is slower (Gorilla/ALP cost real CPU). It's a storage play, not a hot path.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;protobuf and flatbuffers still win raw single-message decode and single-tiny-message size&lt;/strong&gt; — generated code and zero-copy field access are hard to beat when there's no batch to amortize over. Different tool for "one small message, decoded whole, hot."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;qdf's sweet spot is &lt;strong&gt;batches of structured records&lt;/strong&gt; you want small on the wire and partially readable later: telemetry, logging, metrics, analytics, event sourcing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;go get github.com/alex60217101990/qdf
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pure Go, zero dependencies — nothing to vendor, no schema compiler in your pipeline. Swap it in where you use &lt;code&gt;encoding/json&lt;/code&gt;, flip a batch path to &lt;code&gt;OptBalanced|OptColumnIndex&lt;/code&gt;, read back just the columns you need — then go stare at your allocation graph.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Repo: &lt;a href="https://github.com/alex60217101990/qdf" rel="noopener noreferrer"&gt;github.com/alex60217101990/qdf&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Full API reference: &lt;a href="https://pkg.go.dev/github.com/alex60217101990/qdf@v0.0.1" rel="noopener noreferrer"&gt;pkg.go.dev/github.com/alex60217101990/qdf&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the query model or the codec picker is useful to you, a ⭐ on the repo helps others find it. And if you find a payload shape where qdf loses that it shouldn't — open an issue with the fixture. That's the most useful bug report there is.&lt;/p&gt;

</description>
      <category>go</category>
      <category>opensource</category>
      <category>serialization</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
