DEV Community

Naveen Badiger
Naveen Badiger

Posted on

How I built a 39 compression pipeline with AES-256-GCM in Python (and why the dictionary is everything)

QUANTUM-PULSE — docker up, seal, unseal, benchmark

I store LLM training data. Every tool I found either compresses it or encrypts it — nothing did both. So I built QUANTUM-PULSE.

The pipeline

payload → MsgPack → Zstd-L22 + corpus dict → AES-256-GCM → SHA3-256 Merkle
Enter fullscreen mode Exit fullscreen mode

Step 1: MsgPack over JSON

Before compression, MsgPack shrinks the payload by ~22%:

import msgpack
raw = msgpack.packb(payload, use_bin_type=True)
# 22% smaller than json.dumps().encode() — better input = better downstream ratio
Enter fullscreen mode Exit fullscreen mode

Step 2: The dictionary insight

Standard Zstd builds a probability model from scratch every time. For training records sharing the same schema, this is wasted work.

Train once:

import zstandard as zstd

dict_data = zstd.train_dictionary(131072, corpus_samples[:200])
cctx = zstd.ZstdCompressor(level=22, dict_data=dict_data)
compressed = cctx.compress(raw)
Enter fullscreen mode Exit fullscreen mode

Result: 28.46× with dict vs 14.64× vanilla — +94.4% improvement, 29% faster.
The dictionary retrains automatically every 24h via APScheduler as new data arrives.

Step 3: AES-256-GCM with per-blob HKDF keys

One passphrase → master key → unique key per blob:

from cryptography.hazmat.primitives.kdf.hkdf import HKDF
from cryptography.hazmat.primitives.ciphers.aead import AESGCM
import os

# Unique key per blob — one compromise reveals nothing about others
blob_key = HKDF(
    algorithm=hashes.SHA256(), length=32,
    salt=blob_salt, info=pulse_id.encode()
).derive(master_key)

nonce = os.urandom(12)  # fresh per seal
ciphertext = AESGCM(blob_key).encrypt(nonce, compressed, None)
Enter fullscreen mode Exit fullscreen mode

Step 4: SHA3-256 Merkle tree

Every unseal verifies a Merkle proof before returning any data. Silent corruption — bit rot, tampered storage, partial writes — is caught cryptographically, not by hoping checksums match.

Benchmark results

Algorithm Ratio Time Enc Integrity
snappy 12× 1.3ms
gzip-9 62× 9.9ms
zstd-L3 76× 1.6ms
QUANTUM-PULSE 95× 590ms
zstd-L22 99× 1745ms
brotli-11 112× 1441ms

QUANTUM-PULSE is the only option with both encryption and integrity — and it's 3× faster than vanilla zstd-L22.

Honest limitations

  • No formal third-party crypto audit yet (private reporting in SECURITY.md)
  • PBKDF2-SHA256 over Argon2 — Argon2 planned for v1.1
  • MongoDB-first — S3/GCS backends on the roadmap

Try it

git clone https://github.com/Naveenub/quantum-pulse
cp .env.example .env   # set QUANTUM_PASSPHRASE
docker-compose up -d

qp seal dataset.json --tag version=v1
qp unseal <pulse-id>

python scripts/benchmark_demo.py   # reproduce the numbers
Enter fullscreen mode Exit fullscreen mode

MIT license. 277 tests.
https://github.com/Naveenub/quantum-pulse

Top comments (0)