How We Compressed 63.5 GB of Financial Tick Data to 5.5 GB

#rust #compression #timeseries

At AlphaBots, we run an algorithmic trading platform that processes live market data across Indian equity and derivatives markets. Every second, we capture 1-second snapshot data and full tick data across Nifty, BankNifty, and equity instruments. It adds up fast — gigabytes of new data every single trading day, compounding.

We store this data for backtesting, strategy validation, and compliance. After a few months of live operation, the storage bill started hurting. Loading large Parquet files for backtesting runs was slow — we were spending more time moving data around than actually running strategies.

We tried Parquet's built-in ZSTD. It helped, but not enough. So we built our own compression engine. Here's what we learned.

The Insight: Tick Data Has Exploitable Structure

Financial tick data is not random. It has properties general-purpose compressors ignore:

Prices move in tiny increments. A Nifty futures price might go 22,450.25 → 22,450.50 → 22,450.25. The raw float64 values look different. But the differences — +0.25, -0.25 — are tiny and repetitive. Store differences instead of raw values and the data collapses dramatically.

Columns are homogeneous. All prices are floats in a similar range. All timestamps are sequential. Columnar storage exploits this — you compress each column independently, so the compressor sees 8 million prices together, not interleaved with volumes and symbols.

Data is written once and read rarely. Tick archives are almost never updated after writing. We can afford to spend more time compressing, because decompression happens only a handful of times per dataset.

These three properties together suggested a pipeline general-purpose tools weren't exploiting.

The Pipeline: Four Steps Before ZSTD Sees Anything

We built TSC as a Rust-native engine. Here's the pipeline:

Step 1 — Columnar layout. Split the dataset into individual columns. Process each independently. The compressor sees homogeneous data — all prices together, all timestamps together.

Step 2 — Delta encoding. Store the difference between consecutive values instead of raw values. For a price column: 22450.25 (baseline), +0.25, -0.25. For timestamps with 1-second resolution: differences are often literally 1. They compress to almost nothing.

Step 3 — Bit-packing. After delta encoding, each value fits in far fewer bits. Small deltas that fit in 8 bits get stored in 8 bits, not 64.

Step 4 — ZSTD as the final pass. Only now does ZSTD see the data — working on already-small packed integers, not raw floats. This is the key insight: ZSTD on pre-processed data significantly outperforms ZSTD on raw data. The pre-processing is what beats Parquet's built-in compression.

The pipeline runs in O(1) memory — fixed-size chunks, constant RAM regardless of input size.

Results

All tests 100% lossless — every row and column verified after full round-trip.

Dataset	Rows	Input	TSC	vs Parquet zstd
Nifty historical	~15M	63.5 GB	5.5 GB	91.6% smaller
EQY US ALL BBO	8.8M	118.92 MB	30.09 MB	74.7% smaller
Options Greeks	1M	baseline	—	66.6% smaller

Compared against gzip on the 63.5 GB dataset: TSC produced 5.5 GB vs gzip's 7.5 GB — 27% smaller than gzip, with under 7 GB RAM throughout.

For AlphaBots, this translated into direct storage cost reduction. Months of tick data now fits in a fraction of its previous space. Backtesting loads faster.

Honest Trade-offs

TSC is not a Parquet replacement. Use Parquet when:

Random access queries — TSC decompresses full chunks, not individual rows. Point queries are slower than Parquet.
Fast writes — TSC's pipeline takes more time to compress than Parquet. Deliberate trade-off for better archival ratio.
Mixed-type / sparse data — Delta encoding doesn't help strings or sparse columns. Gains are minimal on wide tables with lots of non-numeric data. The sweet spot: Dense numeric time-series, write-once, read in batch. Financial tick data. IoT sensor telemetry. Metrics archives.

Using It

Built in Rust with Python bindings via PyO3. Zero-copy Arrow/Polars/Pandas integration. Pre-built wheels for Linux and Windows (Python 3.11 and 3.12).

import pandas as pd
import tsc

# Compress
df = pd.read_parquet("tick_data.parquet")
payload = tsc.compress(df, mode="balanced", sort_key="auto")

# Decompress
restored = tsc.decompress(payload)

For Parquet/CSV/DuckDB file workflows:

from alphabots_tsc_wrapper import TSCompressor, TSDecompressor

TSCompressor(profile="balanced").compress_file("data.parquet", "data.tsc")
df = TSDecompressor().decompress_polars("data.tsc")

Try it on your own data — no install needed:
Upload a Parquet or CSV file (up to 200 MB) and see the compression ratio on your actual data in about two minutes.

👉 TSC Compression Service

Pre-built wheels + docs:

👉 GitHub — adminalphabots/alphabots-tsc-engine

What's Next

We built TSC for our own use at AlphaBots. The benchmarks are strong enough that we think it has broader applicability — particularly for platforms storing large volumes of financial or IoT time-series.

We're exploring commercial licensing and IP transfer. If you're working on a TSDB, market data platform, or storage infrastructure where compression ratio matters, reach out: parth.k@alphabots.in

TSC is free for evaluation and non-commercial use. Commercial licensing: parth.k@alphabots.in