How I built a 39x compression pipeline with AES-256-GCM in Python (and why the dictionary is everything)

#python #machinelearning #security #dataengineering

I store LLM training data. Every tool I found either compresses it or encrypts it — nothing did both. So I built QUANTUM-PULSE.

The pipeline

payload → MsgPack → Zstd-L22 + corpus dict → AES-256-GCM → SHA3-256 Merkle

Step 1: MsgPack over JSON

Before compression, MsgPack shrinks the payload by ~22%:

import msgpack
raw = msgpack.packb(payload, use_bin_type=True)
# 22% smaller than json.dumps().encode() — better input = better downstream ratio

Step 2: The dictionary insight

Standard Zstd builds a probability model from scratch every time. For training records sharing the same schema, this is wasted work.

Train once:

import zstandard as zstd

dict_data = zstd.train_dictionary(131072, corpus_samples[:200])
cctx = zstd.ZstdCompressor(level=22, dict_data=dict_data)
compressed = cctx.compress(raw)

Result: 28.46× with dict vs 14.64× vanilla — +94.4% improvement, 29% faster.
The dictionary retrains automatically every 24h via APScheduler as new data arrives.

Step 3: AES-256-GCM with per-blob HKDF keys

One passphrase → master key → unique key per blob:

from cryptography.hazmat.primitives.kdf.hkdf import HKDF
from cryptography.hazmat.primitives.ciphers.aead import AESGCM
import os

# Unique key per blob — one compromise reveals nothing about others
blob_key = HKDF(
    algorithm=hashes.SHA256(), length=32,
    salt=blob_salt, info=pulse_id.encode()
).derive(master_key)

nonce = os.urandom(12)  # fresh per seal
ciphertext = AESGCM(blob_key).encrypt(nonce, compressed, None)

Step 4: SHA3-256 Merkle tree

Every unseal verifies a Merkle proof before returning any data. Silent corruption — bit rot, tampered storage, partial writes — is caught cryptographically, not by hoping checksums match.

Benchmark results

Algorithm	Ratio	Time	Enc	Integrity
snappy	12×	1.3ms	✗	✗
gzip-9	62×	9.9ms	✗	✗
zstd-L3	76×	1.6ms	✗	✗
QUANTUM-PULSE	95×	590ms	✓	✓
zstd-L22	99×	1745ms	✗	✗
brotli-11	112×	1441ms	✗	✗

QUANTUM-PULSE is the only option with both encryption and integrity — and it's 3× faster than vanilla zstd-L22.

Honest limitations

No formal third-party crypto audit yet (private reporting in SECURITY.md)
PBKDF2-SHA256 over Argon2 — Argon2 planned for v1.1
MongoDB-first — S3/GCS backends on the roadmap

Try it

git clone https://github.com/Naveenub/quantum-pulse
cp .env.example .env   # set QUANTUM_PASSPHRASE
docker-compose up -d

qp seal dataset.json --tag version=v1
qp unseal <pulse-id>

python scripts/benchmark_demo.py   # reproduce the numbers