kodomonocch1

Posted on Feb 19

Searchable JSON compression: page-level random access + ms lookups (and smaller than Zstd on our dataset)

#compression #json #performance #rust

Searchable JSON compression with page-level random access (and smaller than Zstd on our dataset)

Most JSON compression stories end at “make it smaller.”

But in real systems, the bigger cost is often decompress + parse + scan — repeatedly.

I built SEE (Semantic Entropy Encoding): a searchable compression format for JSON/NDJSON that keeps data queryable while compressed, with page-level random access.

On our dataset, SEE is smaller than Zstd and supports fast lookups (details + proof below).

Why this matters: the hidden “decompress+parse tax”

If you store NDJSON as zstd, most queries still pay:

read large chunks
decompress everything
parse JSON
scan for the field/value you need

Even if the data is small, the CPU + I/O pattern is brutal at scale.

SEE targets workloads where you repeatedly need:

exists / pos / eq-style queries
random access
low latency without full decompression

What SEE is (in 60 seconds)

SEE is a page-based, schema-aware format:

page-level layout for random access
Bloom + skip to avoid touching irrelevant pages (high skip rate)
schema-aware encoding (structure + deltas + dictionary where useful)
designed to reduce both:
- data tax (storage/egress)
- CPU tax (decompress/parse)

Trade-off: SEE optimizes for low I/O and low latency, not always absolute minimum size (though it can win on size too, depending on the dataset).

KPI snapshot (public demo)

These are the numbers we publish from the demo pack:

Combined size ratio: ≈ 19.5% of raw
Lookup latency (present): p50 ≈ 0.18 ms / p95 ≈ 0.28 ms / p99 ≈ 0.34 ms
Skip ratio: present ≈ 0.99 / absent ≈ 0.992
Bloom density: ≈ 0.30

“Combined” is the total footprint for the SEE artifact on the dataset we benchmarked.

Proof-first distribution (so you can verify without meetings)

I intentionally ship reproducible packs:

1) Demo ZIP (10 minutes)

prebuilt wheel + sample .see artifacts
demo scripts that print KPIs (ratio/skip/bloom/p50–p99)
OnePager PDF

2) DD Pack (audit / repro artifacts)

run summaries + run_metrics.json
verification checklist (pack_verify.txt)
designed for technical diligence

Recent robustness milestone: strict decode mismatch checks across multiple datasets = 0
(decode_mismatch_count=0, decode_extended_mismatch_count=0, audit PASS).

Quick start (demo)

pip install see_proto
python samples/quick_demo.py

This prints:

compression ratio
skip/bloom
lookup p50/p95/p99

What I’m looking for

SEE is not a SaaS product.
I’m exploring strategic acquisition or an exclusive license with teams that have a clear integration path.

To keep evaluation high-signal, I run up to a small number of NDA evals per month.

If you’re on a data platform / infra / storage team and you can point to where this fits, I’d love to hear from you.

DEV Community