Searchable JSON compression with page-level random access (and smaller than Zstd on our dataset)
Most JSON compression stories end at “make it smaller.”
But in real systems, the bigger cost is often decompress + parse + scan — repeatedly.
I built SEE (Semantic Entropy Encoding): a searchable compression format for JSON/NDJSON that keeps data queryable while compressed, with page-level random access.
On our dataset, SEE is smaller than Zstd and supports fast lookups (details + proof below).
Why this matters: the hidden “decompress+parse tax”
If you store NDJSON as zstd, most queries still pay:
- read large chunks
- decompress everything
- parse JSON
- scan for the field/value you need
Even if the data is small, the CPU + I/O pattern is brutal at scale.
SEE targets workloads where you repeatedly need:
- exists / pos / eq-style queries
- random access
- low latency without full decompression
What SEE is (in 60 seconds)
SEE is a page-based, schema-aware format:
- page-level layout for random access
- Bloom + skip to avoid touching irrelevant pages (high skip rate)
- schema-aware encoding (structure + deltas + dictionary where useful)
- designed to reduce both:
- data tax (storage/egress)
- CPU tax (decompress/parse)
Trade-off: SEE optimizes for low I/O and low latency, not always absolute minimum size (though it can win on size too, depending on the dataset).
KPI snapshot (public demo)
These are the numbers we publish from the demo pack:
- Combined size ratio: ≈ 19.5% of raw
- Lookup latency (present): p50 ≈ 0.18 ms / p95 ≈ 0.28 ms / p99 ≈ 0.34 ms
- Skip ratio: present ≈ 0.99 / absent ≈ 0.992
- Bloom density: ≈ 0.30
“Combined” is the total footprint for the SEE artifact on the dataset we benchmarked.
Proof-first distribution (so you can verify without meetings)
I intentionally ship reproducible packs:
1) Demo ZIP (10 minutes)
- prebuilt wheel + sample
.seeartifacts - demo scripts that print KPIs (ratio/skip/bloom/p50–p99)
- OnePager PDF
2) DD Pack (audit / repro artifacts)
- run summaries +
run_metrics.json - verification checklist (
pack_verify.txt) - designed for technical diligence
Recent robustness milestone: strict decode mismatch checks across multiple datasets = 0
(decode_mismatch_count=0, decode_extended_mismatch_count=0, audit PASS).
Quick start (demo)
pip install see_proto
python samples/quick_demo.py
`
This prints:
- compression ratio
- skip/bloom
- lookup p50/p95/p99
Links
- GitHub repo: https://github.com/kodomonocch1/see_proto
- Release (v0.1.1): https://github.com/kodomonocch1/see_proto/releases/tag/v0.1.1
If you want formal evaluation under NDA (DD pack / deeper materials):
https://docs.google.com/forms/d/e/1FAIpQLScV2Ti592K3Za2r_WLUd0E6xSvCEVnlEOxYd6OGgbpJm0ADlg/viewform?usp=header
Note: company email is preferred, but DMs are welcome too (no confidential data needed at first contact).
What I’m looking for
SEE is not a SaaS product.
I’m exploring strategic acquisition or an exclusive license with teams that have a clear integration path.
To keep evaluation high-signal, I run up to a small number of NDA evals per month.
If you’re on a data platform / infra / storage team and you can point to where this fits, I’d love to hear from you.
Top comments (0)