<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: MS Dev</title>
    <description>The latest articles on DEV Community by MS Dev (@ms-dev).</description>
    <link>https://dev.to/ms-dev</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3072986%2F7e1b6cd0-f54f-465d-b5dd-c6feb343b092.png</url>
      <title>DEV Community: MS Dev</title>
      <link>https://dev.to/ms-dev</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ms-dev"/>
    <language>en</language>
    <item>
      <title>Optimizing ClickHouse for Financial Time Series</title>
      <dc:creator>MS Dev</dc:creator>
      <pubDate>Mon, 21 Apr 2025 22:41:10 +0000</pubDate>
      <link>https://dev.to/ms-dev/optimizing-clickhouse-for-financial-time-series-39d6</link>
      <guid>https://dev.to/ms-dev/optimizing-clickhouse-for-financial-time-series-39d6</guid>
      <description>&lt;p&gt;Our product, &lt;a href="https://marketlens.app" rel="noopener noreferrer"&gt;https://marketlens.app&lt;/a&gt;, is a market visualization tool that receives data from exchanges’ WebSockets. Top exchanges produce gigabytes of data daily. For instance, Binance ETH/USDT perpetual futures recorded 7 million trades in a single day on February 3, 2025.&lt;/p&gt;

&lt;p&gt;To illustrate the volume of trades, we queried the top three trading days by trade count:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;toDate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;trades&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;toDate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symbol&lt;/th&gt;
&lt;th&gt;Date&lt;/th&gt;
&lt;th&gt;Trade Count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;binance:futures:linear:ETHUSDT&lt;/td&gt;
&lt;td&gt;2021-05-19&lt;/td&gt;
&lt;td&gt;8,001,640&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;bybit:futures:linear:ETHUSDT&lt;/td&gt;
&lt;td&gt;2025-02-03&lt;/td&gt;
&lt;td&gt;7,229,811&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;bybit:futures:linear:BTCUSDT&lt;/td&gt;
&lt;td&gt;2025-02-03&lt;/td&gt;
&lt;td&gt;6,713,708&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Orderbook updates can significantly outnumber trades. For example, Binance BTC/USDT perpetual futures had 267 million orderbook updates on April 7, 2025, despite Binance providing only aggregated updates every 100 milliseconds.&lt;/p&gt;

&lt;p&gt;Here are the top three days by orderbook update volume:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;toDate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;length&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prices&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;num_of_updates&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;orderbooks&lt;/span&gt; &lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;symbol&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;toDate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;num_of_updates&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt; &lt;span class="k"&gt;limit&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symbol&lt;/th&gt;
&lt;th&gt;Date&lt;/th&gt;
&lt;th&gt;Number of Updates&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;binance:futures:linear:BTCUSDT&lt;/td&gt;
&lt;td&gt;2025-04-07&lt;/td&gt;
&lt;td&gt;267,488,693&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;binance:futures:linear:BTCUSDT&lt;/td&gt;
&lt;td&gt;2025-03-04&lt;/td&gt;
&lt;td&gt;246,896,146&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;binance:futures:linear:BTCUSDT&lt;/td&gt;
&lt;td&gt;2025-03-07&lt;/td&gt;
&lt;td&gt;241,906,906&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;We have collected approximately 900 GB of data over the past several months for top markets like BTC and ETH on Binance, and we have optimized storage as much as possible. We would like to share our findings with you.&lt;/p&gt;

&lt;p&gt;Both trades and orderbook updates can be naturally sorted by timestamps, which determines the order for columns and codecs. Let’s assume this sort order for our analysis.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade IDs
&lt;/h2&gt;

&lt;p&gt;Trade IDs are used by exchanges to uniquely identify each trade. Most exchanges use sequential IDs, while some, like Bybit Futures, use UUIDs.&lt;/p&gt;

&lt;p&gt;Sequential IDs are straightforward, and the &lt;code&gt;DoubleDelta&lt;/code&gt; codec performs best. The &lt;code&gt;Delta&lt;/code&gt; codec transforms a sequence like &lt;code&gt;[2, 3, 4, 5, 6]&lt;/code&gt; into &lt;code&gt;[1, 1, 1, 1, 1]&lt;/code&gt;, and &lt;code&gt;DoubleDelta&lt;/code&gt; goes further, producing &lt;code&gt;[0, 0, 0, 0, 0]&lt;/code&gt;. Applying &lt;code&gt;ZSTD&lt;/code&gt; further improves the compression ratio, as shown below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9pqgob983ng06x0tagd1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9pqgob983ng06x0tagd1.png" width="800" height="228"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;UUIDs, on the other hand, are designed to be unique and random, making them effectively incompressible. Compression methods like &lt;code&gt;NONE&lt;/code&gt;, &lt;code&gt;LZ4&lt;/code&gt;, and &lt;code&gt;ZSTD&lt;/code&gt; yield a compression ratio of 1. Fortunately, Bybit Futures is the only exchange we encountered that relies on UUIDs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade Timestamps
&lt;/h2&gt;

&lt;p&gt;Binance provides a stream that aggregates trades with the same prices, while Bybit pushes all unique trades separately. Both claim to push data in real-time. Most exchanges provide trade timestamps with millisecond precision, with only Coinbase providing microseconds. Therefore, our datatype is &lt;code&gt;DateTime64(6, 'UTC')&lt;/code&gt;, which stores microseconds since epoch start (1970-01-01 00:00:00 UTC) as &lt;code&gt;Int64&lt;/code&gt;. It seems that trade timestamps will be quite random, but let’s see what we get.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhhyxxnfrrno2g8btq5vr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhhyxxnfrrno2g8btq5vr.png" width="800" height="297"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ZSTD&lt;/code&gt; effectively eliminates repetitive parts in sequential timestamps. We didn’t expect anything from &lt;code&gt;T64&lt;/code&gt; and were surprised that it improved plain &lt;code&gt;LZ4&lt;/code&gt; compression, however, worsened plain &lt;code&gt;ZSTD&lt;/code&gt;. &lt;code&gt;Delta&lt;/code&gt; improved compression for both &lt;code&gt;LZ4&lt;/code&gt; and &lt;code&gt;ZSTD&lt;/code&gt; probably just by simplifying work for them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade Prices
&lt;/h2&gt;

&lt;p&gt;Prices are decimal values. Binance and Bybit provide them as JSON strings, while Deribit uses the JSON number type, which is effectively a string as well, just without quotes. Precision varies significantly. The smallest precision we encountered was &lt;a href="https://www.mexc.com/exchange/ZENIX_USDT" rel="noopener noreferrer"&gt;ZENIX/USDT on Mexc&lt;/a&gt; with 25(!) decimal places; most exchanges are more reasonable, with no more than 9 decimal places used. We considered several options.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;Float64&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;The binary internal representation of floats leads to precision loss for decimal numbers; however, under certain constraints, precision loss is recoverable back to exact initial decimal values.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frztzsec23ee8qs9fg4ux.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frztzsec23ee8qs9fg4ux.png" width="800" height="297"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Trade price changes are random but changes are likely just several ticks, so the &lt;code&gt;Delta&lt;/code&gt; codec gives a clear improvement. &lt;code&gt;Gorilla&lt;/code&gt; is a codec specifically designed for floating-point numbers and their binary representation. It calculates XOR between adjacent values and writes it in compact binary form. Here we see a little improvement over plain &lt;code&gt;LZ4&lt;/code&gt; compression, but no improvement over plain &lt;code&gt;ZSTD&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;We tried “dumb” combinations like &lt;code&gt;Gorilla + Delta&lt;/code&gt; and &lt;code&gt;Delta + Gorilla&lt;/code&gt; and we got:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;Gorilla + Delta + LZ4&lt;/code&gt;: 1.89&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;Gorilla + Delta + ZSTD&lt;/code&gt;: 1.88&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;Delta + Gorilla + LZ4&lt;/code&gt;: 2.08&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;Delta + Gorilla + ZSTD&lt;/code&gt;: 3.12&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;code&gt;Gorilla&lt;/code&gt; and &lt;code&gt;Delta&lt;/code&gt; duplicate each other’s purpose and reasonably give bad results.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;Decimal(38, 19)&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;This format uses 38 digits, with 19 allocated to the fractional part, and is stored as &lt;code&gt;UInt128&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg62wplrlgjein2wc4zo2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg62wplrlgjein2wc4zo2.png" width="800" height="205"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;GCD&lt;/code&gt; is an interesting codec. It calculates the greatest common divisor for values inside blocks (usually from 64KB to 1MB), then divides each value by it and stores the GCD value and values after divisions. It looks perfect for decimal prices that have implicit or explicit ticks. &lt;code&gt;Decimal(38, 19)&lt;/code&gt; is a 128-bit number which is 2x of &lt;code&gt;Float64&lt;/code&gt;, so 18.0 compression ratio above corresponds to 9.0 for &lt;code&gt;Float64&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Unfortunately, the &lt;code&gt;Delta&lt;/code&gt; codec cannot be applied to 16-byte numbers. It’s possible to specify &lt;code&gt;Delta(8)&lt;/code&gt;, which limits differences to 8 bytes, but if a difference overflows there will be no error and result values will be meaningless.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;UInt64&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Exchanges’ specifications for markets provide minimal price ticks. We could just store price as an integer number of minimal ticks. However, we later discovered price ticks can change from time to time. Also, Bitfinex and Hyperliquid do not use a minimal tick concept. Instead, they specify a number of significant digits for a price, making minimal ticks become dynamic.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;Tuple(UInt64, Int8)&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;A pair of &lt;code&gt;(x, p)&lt;/code&gt; so that an initial price is represented as &lt;code&gt;x * 10 ^ p&lt;/code&gt;. &lt;code&gt;p&lt;/code&gt; is chosen to minimize the number of non-zero digits.&lt;/p&gt;

&lt;p&gt;ClickHouse does not allow specifying individual codecs for tuple components. Tuple components are stored as separate streams and compressed individually. The precision component has very low cardinality, so codecs have minimal impact because their values can be easily dictionary encoded by both &lt;code&gt;LZ4&lt;/code&gt; and &lt;code&gt;ZSTD&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi41kczq91xa6pb0thhfp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi41kczq91xa6pb0thhfp.png" width="800" height="297"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Winner
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;Float64&lt;/code&gt; takes 8 bytes, &lt;code&gt;Decimal(38, 19)&lt;/code&gt; - 16, and &lt;code&gt;Tuple(UInt64, Int8)&lt;/code&gt; - 9. Let’s compare storage sizes in bytes of the best results.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flt4u0pcqvye5bnl54rxj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flt4u0pcqvye5bnl54rxj.png" width="800" height="194"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see that &lt;code&gt;Tuple(UInt64, Int8)&lt;/code&gt; improves a bit over &lt;code&gt;Decimal(38, 19)&lt;/code&gt;. However, it’s quite good; probably it could benefit from the &lt;code&gt;Delta&lt;/code&gt; codec if &lt;code&gt;Delta(16)&lt;/code&gt; were possible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Trade Sizes
&lt;/h2&gt;

&lt;p&gt;Trade sizes are similar to prices. Most exchanges provide a minimal tick size that can change from time to time. Bitfinex has fixed precision of 8 decimal places. Hyperliquid relies on the same maximum significant digits concept as for prices.&lt;/p&gt;

&lt;p&gt;However, trade sizes differ in their expected values. Prices evolve over time: one trade ticks up, another trade ticks down, and can sit in a specific range for some time. Sizes, on the other hand, appear to be more random. Let’s compare distributions of prices and sizes for Binance Perpetual Futures ETH/USDT trades on May 19, 2021. The Y-axis is the number of trades.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6r1sr9f6wpe2gxfmotyy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6r1sr9f6wpe2gxfmotyy.png" width="800" height="342"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And let’s take a look at a more calm day like yesterday, April 19, 2025.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvcm8q3lkbtxmg6jq3i3k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvcm8q3lkbtxmg6jq3i3k.png" width="800" height="342"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Unfortunately, compressing sizes is going to be more difficult. We already know that &lt;code&gt;Float64&lt;/code&gt; isn’t that compressible, so let’s explore &lt;code&gt;Decimal(38, 19)&lt;/code&gt; and &lt;code&gt;(Int64, Int8)&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;Decimal(38, 19)&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy9uxr87zzqge5zagkfnl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy9uxr87zzqge5zagkfnl.png" width="800" height="205"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It’s surprising that &lt;code&gt;GCD&lt;/code&gt; almost doesn’t improve compression. Apparently, the reason is that we use signed sizes, where positive values indicate buys and negative - sells. &lt;code&gt;GCD&lt;/code&gt; doesn’t work when integers have mixed signs. So, we switched to a tuple &lt;code&gt;Tuple(Decimal(38, 19), Bool)&lt;/code&gt; and retested.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn4qk782558vojni2t9aj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn4qk782558vojni2t9aj.png" width="800" height="114"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;That’s much better!&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;Tuple(Int64, Int8)&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F54d0kujen9ckj56ehw7c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F54d0kujen9ckj56ehw7c.png" width="800" height="297"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;Delta&lt;/code&gt; worsens results, which is usually the case when series data are random. &lt;code&gt;T64&lt;/code&gt; improves results a bit.&lt;/p&gt;

&lt;h3&gt;
  
  
  Winner
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcprd2rn6s9cgquzwxwoa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcprd2rn6s9cgquzwxwoa.png" width="800" height="160"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We think that &lt;code&gt;Decimal + GCD&lt;/code&gt; performed better because it was able to find better common divisors than our code that converts decimals into &lt;code&gt;(x, p)&lt;/code&gt; pairs.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusions
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;ZSTD&lt;/code&gt; provides better compression than &lt;code&gt;LZ4&lt;/code&gt;. &lt;code&gt;ZSTD&lt;/code&gt; alone, without specialized codecs, provides good results.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;Delta&lt;/code&gt; is essential for monotonic values, such as ordered but random timestamps. &lt;code&gt;DoubleDelta&lt;/code&gt; is perfect for sequential IDs.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;T64&lt;/code&gt; provides a slight improvement for values with limited range&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;Float64&lt;/code&gt; is not good for decimal values. Instead, we can use a larger type like &lt;code&gt;Decimal(38, 19)&lt;/code&gt; with &lt;code&gt;GCD&lt;/code&gt;, ensuring that prices and sizes are accurately represented without concern for type constraints.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These strategies helped us to significantly reduce storage size for MarketLens from 5.91 TiB to 897.97 GiB.&lt;/p&gt;

</description>
      <category>clickhouse</category>
      <category>cryptocurrency</category>
      <category>database</category>
    </item>
  </channel>
</rss>
