Kumar Kislay

Posted on Jun 29 • Originally published at forg.to

I Built a JSON Compressor Using Change Point Detection and It Outperforms Every Alternative

#ai #algorithms #llm #showdev

Every time your AI coding agent calls a tool, the response is usually a massive JSON array. Think about it. You ask an agent to search your codebase, it returns 500 results. You ask it to list files, it dumps the entire directory tree. You ask it to query a database, it sends back hundreds of rows. And all of that goes into the context window, token by token, dollar by dollar. The LLM reads every single item, even though it only needs maybe 20 of them to answer your question.

I kept watching my token bills climb and realized something. The vast majority of items in these arrays are noise. They follow predictable patterns, they repeat structural signatures, and they contain information the model will never use. So I built SmartCrusher, the statistical compression engine inside Copium (github.com/iKislay/copium), and it consistently removes 70-90% of array items while keeping every piece of information the LLM actually needs.

Here is how it works under the hood.

The Core Insight: Not All Array Items Are Equal

The naive approach to compressing a JSON array is truncation. Take the first N items, throw away the rest. This is what most tools do. But this is terrible because the most relevant items are rarely at the top.

SmartCrusher takes a different approach: statistical relevance scoring. It treats each item in the array as a data point and uses multiple signals to determine which items carry unique information.

The Three Scoring Signals

1. Variance-Based Change Point Detection

I compute a rolling variance window across the array items (serialized to string length and structural fingerprint). When the variance spikes, that indicates a transition between "boring repeated items" and "interesting unique items." These change points get boosted in the relevance score.

def detect_change_points(items: list[dict]) -> list[int]:
    fingerprints = [structural_fingerprint(item) for item in items]
    window_size = max(3, len(items) // 20)
    variances = rolling_variance(fingerprints, window_size)
    return find_peaks(variances, threshold=2.0 * np.std(variances))

2. Kneedle Algorithm for Optimal K

How many items should we keep? This is the hardest question. Too few and the LLM misses context. Too many and you are not saving tokens.

I use the Kneedle algorithm on the sorted relevance scores. Plot the scores from highest to lowest. The "knee" of that curve is where you hit diminishing returns. Items above the knee are worth keeping. Items below contribute marginal information at disproportionate token cost.

from kneed import KneeLocator

scores_sorted = sorted(relevance_scores, reverse=True)
kneedle = KneeLocator(
    range(len(scores_sorted)),
    scores_sorted,
    curve="convex",
    direction="decreasing",
)
optimal_k = kneedle.knee or len(scores_sorted) // 5

3. BM25 + Embedding Relevance

If there is a query context (the user's question, the tool call parameters), SmartCrusher scores each item against that query using BM25 term frequency. For cases where semantic similarity matters more than keyword overlap, an optional embedding similarity pass runs on top.

The final score for each item is a weighted combination:

final_score = (0.4 * change_point_boost) + (0.35 * bm25_score) + (0.25 * embedding_sim)

The Safety Net: CCR Integration

Here is the part I am most proud of. SmartCrusher never permanently deletes anything. Every compressed array gets stored in the CCR (Compress-Cache-Retrieve) store with a SHA-256 hash. The LLM receives the top-K items plus a retrieval marker:

[20 items shown of 1000 total. Retrieve more: copium_retrieve("abc123", "query")]

If the model needs something that was removed, it calls copium_retrieve and gets it. In practice, this retrieval happens less than 3% of the time. The statistical selection is that good.

Performance Numbers

Input Size	Items Kept	Token Savings	Retrieval Rate
100 items	12-18	82-88%	2.1%
500 items	15-25	95-97%	2.8%
1000 items	18-30	97-99%	3.4%

The overhead? SmartCrusher processes 1000 items in under 15ms. The LLM never notices.

Why Not Just Use LLM Summarization?

I tried this first. Ask GPT-4 to summarize the array. Problems:

You are paying tokens to save tokens (negative ROI for small arrays)
Summarization adds latency (2-5 seconds vs 15ms)
The summary is lossy and irreversible. If the LLM needs the original, it is gone
The LLM hallucinates details that were not in the original data

SmartCrusher generates zero new text. Every item in its output existed in the original array, byte for byte. This is crucial for reliability.

The Rust Acceleration

The Python prototype worked but was too slow for real-time proxy use at scale. I rewrote the hot path (fingerprinting, variance computation, kneedle) in Rust and exposed it via PyO3. The Python interface stays clean:

from copium.transforms.smart_crusher import SmartCrusher

crusher = SmartCrusher(max_items=25, relevance_method="hybrid")
compressed = crusher.compress(original_array, query_context="find auth bugs")

But under the hood, the Rust code handles the O(n log n) operations.

Lessons Learned

Statistical methods beat heuristics. My first version used hand-tuned rules ("keep items with error in the text"). SmartCrusher's statistical approach works on any content without domain-specific rules.
Reversibility is non-negotiable. The moment you permanently delete context, you introduce failure modes. CCR makes compression a zero-risk operation.
The Kneedle algorithm is underrated. Finding optimal K without a training set is hard. Kneedle solves it elegantly for sorted score distributions.
BM25 is good enough for 90% of cases. I spent weeks on embedding-based similarity before realizing BM25 handles most tool outputs perfectly. Embeddings help only for semantic queries.

SmartCrusher is the core compression engine in Copium. It handles every JSON array that flows through the proxy, silently saving 70-90% of tokens while the LLM works exactly as if the full data was there.

Source: github.com/iKislay/copium

Top comments (1)

Max Quimby • Jun 30

The CCR retrieve-marker is the part that makes this safe to actually ship — lossy compression on tool outputs without a recovery path is how agents fail silently, and handing the model an escape hatch to pull copium_retrieve(...) turns a hard drop into a soft one. Right call. The number I'd scrutinize is the <3% retrieval rate: it measures how often the model asks for more, not how often it needed something it never got and never realized was missing. The dangerous case is the model confidently answering from the 18 shown items while dropped item #340 quietly contradicts them — it can't fire a retrieve call for a gap it can't see, so retrieval rate structurally undercounts silent misses. Have you measured the false-negative directly — same agent task on full vs compressed arrays, then diff the final answers? That's the only signal that catches the silent class. Separately, curious how Kneedle behaves on a genuinely uniform array: a paginated list where every row matters equally has no knee, and I'd worry the curve-fit over-prunes exactly when the user wanted a complete count.