<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Yu Qian Yang</title>
    <description>The latest articles on DEV Community by Yu Qian Yang (@yu_qianyang_db00999789f2).</description>
    <link>https://dev.to/yu_qianyang_db00999789f2</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3878715%2Fd15d7afc-e5d1-40ad-a32f-da57f1977247.png</url>
      <title>DEV Community: Yu Qian Yang</title>
      <link>https://dev.to/yu_qianyang_db00999789f2</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/yu_qianyang_db00999789f2"/>
    <language>en</language>
    <item>
      <title>How I Built a Runtime Adaptive Compression System That Selects the Best Algorithm Without Scanning All Data</title>
      <dc:creator>Yu Qian Yang</dc:creator>
      <pubDate>Tue, 14 Apr 2026 13:31:07 +0000</pubDate>
      <link>https://dev.to/yu_qianyang_db00999789f2/how-i-built-a-runtime-adaptive-compression-system-that-selects-the-best-algorithm-without-scanning-j30</link>
      <guid>https://dev.to/yu_qianyang_db00999789f2/how-i-built-a-runtime-adaptive-compression-system-that-selects-the-best-algorithm-without-scanning-j30</guid>
      <description>&lt;p&gt;When I was working on a columnar time-series database built on Greenplum, we faced a&lt;br&gt;
problem that every time-series storage engineer eventually hits:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;No single compression algorithm works best for all data.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A sensor reading temperature every second looks nothing like a financial tick stream.&lt;br&gt;
A column of integer IDs compresses completely differently from a column of floating-point&lt;br&gt;
measurements. And in the real world, the same column can change character over time —&lt;br&gt;
steady signal for hours, then suddenly a burst of noise.&lt;/p&gt;

&lt;p&gt;The naive answer is: let the user pick an algorithm. But that puts the burden on the user,&lt;br&gt;
and in practice they almost always pick wrong — or just leave the default.&lt;/p&gt;

&lt;p&gt;So I designed &lt;strong&gt;automode&lt;/strong&gt;: a runtime analyzer that samples incoming data, detects its&lt;br&gt;
statistical character, and selects the best encoding chain automatically.&lt;/p&gt;

&lt;p&gt;Here's how it works.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Encoding Chain
&lt;/h2&gt;

&lt;p&gt;Before talking about the analyzer, I need to briefly explain what it's selecting between.&lt;/p&gt;

&lt;p&gt;We built a set of encoding algorithms, each optimized for a specific data pattern:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Run-Length Encoding&lt;/strong&gt; — all values are identical&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Delta-of-Delta&lt;/strong&gt; — slowly changing values (timestamps, smooth sensors)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Delta + ZigZag&lt;/strong&gt; — values oscillating around a baseline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bit Packing&lt;/strong&gt; — small integers that can be bit-packed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gorilla&lt;/strong&gt; — floating-point values with small XOR between adjacent values&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The insight that made automode possible: these algorithms operate on fundamentally&lt;br&gt;
different &lt;em&gt;statistical features&lt;/em&gt; of data. If I can identify the dominant feature of a&lt;br&gt;
data stream, I can map it directly to the right algorithm.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Design Problem
&lt;/h2&gt;

&lt;p&gt;The obvious approach — scan all the data and compute statistics — doesn't work in a&lt;br&gt;
storage engine. The whole point of compression is to be fast. Scanning every value&lt;br&gt;
before encoding defeats the purpose.&lt;/p&gt;

&lt;p&gt;I needed a way to characterize a data stream with &lt;strong&gt;minimal overhead&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The solution: &lt;strong&gt;sample windows&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of scanning the full column, I divide the input stream into fixed-size windows&lt;br&gt;
and analyze only a small number of items from each window. Within each window, the&lt;br&gt;
sample start position is randomized to avoid being fooled by local patterns.&lt;/p&gt;

&lt;p&gt;Why randomize? Consider this sequence:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[1 1 1 1 1 1 1 1 1 1 1 1 2 3 2 3 2 3 2 3 2 3 2 3]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you always sample from the start, you'd classify this as "all equal" and pick&lt;br&gt;
Run-Length Encoding. But if you look at the full stream, it's clearly an alternating&lt;br&gt;
pattern — RLE would be terrible. Random sampling within the window gives a much more&lt;br&gt;
representative picture.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Five Features
&lt;/h2&gt;

&lt;p&gt;Each sampled window is analyzed for five statistical features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;All Equal&lt;/strong&gt; — all sampled values are identical&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Values bitwise small&lt;/strong&gt; — the values themselves fit in few bits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1st-order delta small&lt;/strong&gt; — differences between adjacent values are small&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2nd-order delta small&lt;/strong&gt; — differences between differences are small (delta-of-delta)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Float XOR small&lt;/strong&gt; — XOR of adjacent float values is small (Gorilla pattern)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For each window, the analyzer computes 0th, 1st, and 2nd order differentials, plus&lt;br&gt;
XOR pairs for float data. "Small" is defined bitwise: a value is small if its&lt;br&gt;
significant bit length is below a configurable threshold. This makes the check&lt;br&gt;
robust against occasional spike values — a single outlier doesn't ruin the&lt;br&gt;
classification.&lt;/p&gt;




&lt;h2&gt;
  
  
  Strong vs. Weak Features
&lt;/h2&gt;

&lt;p&gt;Not all features are equal. I introduced a &lt;strong&gt;Strong/Weak classification&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Strongest&lt;/strong&gt;: All Equal → Run-Length Encoding&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strong&lt;/strong&gt;: 2nd-order delta small → Delta-of-Delta&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Strong&lt;/strong&gt;: Float XOR small → Gorilla&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weak&lt;/strong&gt;: Values bitwise small → Bit Packing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Weak&lt;/strong&gt;: 1st-order delta small → Delta + ZigZag&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why does this matter? Consider a case where Delta-of-Delta and Bit Packing score&lt;br&gt;
similarly. Delta-of-Delta is fundamentally better for smoothly varying time-series&lt;br&gt;
data, even with a slightly lower score. The weight system ensures strong methods&lt;br&gt;
are preferred when they're competitive.&lt;/p&gt;




&lt;h2&gt;
  
  
  One Edge Case Worth Noting
&lt;/h2&gt;

&lt;p&gt;The alternating pattern &lt;code&gt;[A B A B A B]&lt;/code&gt; is a trap.&lt;/p&gt;

&lt;p&gt;It looks like "values bitwise small", but Bit Packing performs poorly on alternating&lt;br&gt;
sequences. A general-purpose compressor actually handles this pattern better.&lt;/p&gt;

&lt;p&gt;I added an explicit check: if odd-indexed and even-indexed items within the sample&lt;br&gt;
are each internally equal, it's an alternating pattern — skip the Bit Packing&lt;br&gt;
classification and fall back to general compression.&lt;/p&gt;

&lt;p&gt;This kind of edge case only appears when you test against real production data. In our&lt;br&gt;
case, it showed up in sensor data from a deployment at a large financial institution.&lt;/p&gt;




&lt;h2&gt;
  
  
  Feature → Algorithm Mapping
&lt;/h2&gt;

&lt;p&gt;Once features are aggregated across all sample windows, the dominant feature maps to&lt;br&gt;
an algorithm:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;All Equal&lt;/strong&gt; → Run-Length Encoding&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2nd-order delta small&lt;/strong&gt; → Delta-of-Delta&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Values bitwise small&lt;/strong&gt; → Bit Packing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;1st-order delta small&lt;/strong&gt; → Delta + ZigZag&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Float XOR small&lt;/strong&gt; → Gorilla&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No clear pattern&lt;/strong&gt; → General Compression (fallback)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For floating-point columns, the encoding chain adds a float-to-integer conversion pass&lt;br&gt;
first — converting floats that happen to be integers (like &lt;code&gt;1.0&lt;/code&gt;, &lt;code&gt;2.0&lt;/code&gt;) into actual&lt;br&gt;
integers before passing to the integer encoding path. This is worth a separate article.&lt;/p&gt;




&lt;h2&gt;
  
  
  Two Modes
&lt;/h2&gt;

&lt;p&gt;The system supports two optimization priorities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Compression-rate mode&lt;/strong&gt;: maximize storage efficiency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speed mode&lt;/strong&gt;: maximize decompression speed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, compression-rate mode is used for cold storage and analytics workloads,&lt;br&gt;
and speed mode for hot data that's frequently queried. The same feature detection runs&lt;br&gt;
in both cases — only the final algorithm selection weights differ.&lt;/p&gt;




&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;The system is deployed in production at scale, including a large-scale installation&lt;br&gt;
at a major financial institution managing thousands of PBs of time-series data. In&lt;br&gt;
testing, automode consistently matched or outperformed manually-tuned single-algorithm&lt;br&gt;
compression across diverse real-world datasets.&lt;/p&gt;

&lt;p&gt;The key insight that made it work: &lt;strong&gt;you don't need to scan all the data to understand&lt;br&gt;
its character&lt;/strong&gt;. A few dozen samples, taken randomly from fixed-size windows, are&lt;br&gt;
enough to make a reliable classification — as long as you handle the edge cases.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;This article is part of a series on compression engineering in time-series databases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Part 2&lt;/strong&gt;: Float-to-integer encoding — how I used IEEE 754 bit manipulation to detect
convertibility with near-zero overhead&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 3&lt;/strong&gt;: Chained encoding — a unified float compression pipeline&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Part 4&lt;/strong&gt;: An improved floating-point compression algorithm based on ELF&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;I'm currently available for freelance work on backend systems, storage engineering,&lt;br&gt;
and systems integration. Feel free to reach out.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>performance</category>
      <category>cpp</category>
      <category>architecture</category>
    </item>
  </channel>
</rss>
