Making JSON Compression Searchable — SEE (Schema-Aware Encoding)

#dataengineering #finops #cloud #compression

The Problem: Cloud Cost Isn’t Just Storage

Compression is easy — until you need it searchable.

Traditional codecs like gzip and Zstd reduce storage size,
but they do nothing for I/O and CPU cost.

Every query still triggers:

→ decompress → parse → filter → aggregate

If your data is JSON or NDJSON, that pipeline dominates your bill.
That’s what we call the hidden cloud tax — the cost of moving and re-reading your own data.
The Breakthrough: Schema-Aware Compression

SEE (Semantic Entropy Encoding) is a new type of codec that keeps JSON searchable while compressed.

It doesn’t just shrink bytes — it understands the structure.

Core idea:
Structure × Δ (delta) × Zstd + Bloom filters + PageDir mini-index

That means:

You can skip 99% of irrelevant data

Lookup latency ≈ 0.18 ms (p50)

Combined size ≈ 19.5% of raw

100% reproducible from the demo

Architecture in One Picture

👉 SpeakerDeck Slides

SEE vs Zstd:

Metric SEE Zstd
Combined ratio 0.194 0.137
Lookup p50 (ms) 0.18 n/a
Skip rate 0.99 0

SEE trades 5–10% of size for 90% fewer I/O ops.
At cloud scale, that’s not optimization — that’s an economic correction.

Quick Demo (10 minutes)

No build needed. Works on Windows, macOS, or Linux.

pip install see_proto
python samples/quick_demo.py

Outputs:

ratio_see[str] = 0.169
ratio_see[combined] = 0.194
skip_present = 0.99
skip_absent = 0.992
lookup_p50 = 0.18 ms

You’ll get the same metrics as the public benchmark.

Economic Impact

At $0.05/GB egress and 100 EB/month traffic:

Savings = $7.2 B/year

Payback = < 4 days

ROI ≈ 11,000%

Whoever controls SEE, controls cloud economics.

How It’s Built

Core implementation in Rust with a Zstd dictionary backend.
Python bindings (via maturin) make the demo fully reproducible.

The schema-aware layer applies:

Delta + ZigZag integer encoding

Shared dictionaries for string reuse

PageDir and mini-index for random access

Bloom filters for skip prediction

Each .see file includes a compact metadata header so partial decoding is possible.

Try It Yourself

👉 GitHub: (https://github.com/kodomonocch1/see_proto)

👉 Slides (SpeakerDeck): (https://speakerdeck.com/tetsu05/see-the-hidden-cloud-tax-breaker-schema-aware-compression-beyond-zstd)

👉 Deep dive article (Medium): (https://medium.com/@tetsutetsu11/the-hidden-cloud-tax-and-the-schema-aware-revolution-46b5038c57b8
)
If you’ve used Parquet, Zstd, or Arrow — this fits right between them,
but tuned for JSON-first workloads.

Closing Thoughts

SEE isn’t just a faster codec.
It’s a new layer of data efficiency for the cloud economy —
one that turns compression from a technical optimization into a financial advantage.

From Bytes to Balance Sheets.

PS: Discussion

If you’ve tested SEE on your own dataset (logs, telemetry, NDJSON),
share your results — we’re tracking performance across real workloads.