The Problem: Cloud Cost Isn’t Just Storage
Compression is easy — until you need it searchable.
Traditional codecs like gzip and Zstd reduce storage size,
but they do nothing for I/O and CPU cost.
Every query still triggers:
→ decompress → parse → filter → aggregate
If your data is JSON or NDJSON, that pipeline dominates your bill.
That’s what we call the hidden cloud tax — the cost of moving and re-reading your own data.
The Breakthrough: Schema-Aware Compression
SEE (Semantic Entropy Encoding) is a new type of codec that keeps JSON searchable while compressed.
It doesn’t just shrink bytes — it understands the structure.
Core idea:
Structure × Δ (delta) × Zstd + Bloom filters + PageDir mini-index
That means:
You can skip 99% of irrelevant data
Lookup latency ≈ 0.18 ms (p50)
Combined size ≈ 19.5% of raw
100% reproducible from the demo
Architecture in One Picture
👉 SpeakerDeck Slides
SEE vs Zstd:
Metric SEE Zstd
Combined ratio 0.194 0.137
Lookup p50 (ms) 0.18 n/a
Skip rate 0.99 0
SEE trades 5–10% of size for 90% fewer I/O ops.
At cloud scale, that’s not optimization — that’s an economic correction.
Quick Demo (10 minutes)
No build needed. Works on Windows, macOS, or Linux.
pip install see_proto
python samples/quick_demo.py
Outputs:
ratio_see[str] = 0.169
ratio_see[combined] = 0.194
skip_present = 0.99
skip_absent = 0.992
lookup_p50 = 0.18 ms
You’ll get the same metrics as the public benchmark.
Economic Impact
At $0.05/GB egress and 100 EB/month traffic:
Savings = $7.2 B/year
Payback = < 4 days
ROI ≈ 11,000%
Whoever controls SEE, controls cloud economics.
How It’s Built
Core implementation in Rust with a Zstd dictionary backend.
Python bindings (via maturin) make the demo fully reproducible.
The schema-aware layer applies:
Delta + ZigZag integer encoding
Shared dictionaries for string reuse
PageDir and mini-index for random access
Bloom filters for skip prediction
Each .see file includes a compact metadata header so partial decoding is possible.
Try It Yourself
👉 GitHub: (https://github.com/kodomonocch1/see_proto)
👉 Slides (SpeakerDeck): (https://speakerdeck.com/tetsu05/see-the-hidden-cloud-tax-breaker-schema-aware-compression-beyond-zstd)
👉 Deep dive article (Medium): (https://medium.com/@tetsutetsu11/the-hidden-cloud-tax-and-the-schema-aware-revolution-46b5038c57b8
)
If you’ve used Parquet, Zstd, or Arrow — this fits right between them,
but tuned for JSON-first workloads.
Closing Thoughts
SEE isn’t just a faster codec.
It’s a new layer of data efficiency for the cloud economy —
one that turns compression from a technical optimization into a financial advantage.
From Bytes to Balance Sheets.
PS: Discussion
If you’ve tested SEE on your own dataset (logs, telemetry, NDJSON),
share your results — we’re tracking performance across real workloads.
Top comments (0)