Every time your AI coding agent calls a tool, the response is usually a massive JSON array. Think about it. You ask an agent to search your codebase, it returns 500 results. You ask it to list files, it dumps the entire directory tree. You ask it to query a database, it sends back hundreds of rows. And all of that goes into the context window, token by token, dollar by dollar. The LLM reads every single item, even though it only needs maybe 20 of them to answer your question.
I kept watching my token bills climb and realized something. The vast majority of items in these arrays are noise. They follow predictable patterns, they repeat structural signatures, and they contain information the model will never use. So I built SmartCrusher, the statistical compression engine inside Copium (github.com/iKislay/copium), and it consistently removes 70-90% of array items while keeping every piece of information the LLM actually needs.
Here is how it works under the hood.
The Core Insight: Not All Array Items Are Equal
The naive approach to compressing a JSON array is truncation. Take the first N items, throw away the rest. This is what most tools do. But this is terrible because the most relevant items are rarely at the top.
SmartCrusher takes a different approach: statistical relevance scoring. It treats each item in the array as a data point and uses multiple signals to determine which items carry unique information.
The Three Scoring Signals
1. Variance-Based Change Point Detection
I compute a rolling variance window across the array items (serialized to string length and structural fingerprint). When the variance spikes, that indicates a transition between "boring repeated items" and "interesting unique items." These change points get boosted in the relevance score.
def detect_change_points(items: list[dict]) -> list[int]:
fingerprints = [structural_fingerprint(item) for item in items]
window_size = max(3, len(items) // 20)
variances = rolling_variance(fingerprints, window_size)
return find_peaks(variances, threshold=2.0 * np.std(variances))
2. Kneedle Algorithm for Optimal K
How many items should we keep? This is the hardest question. Too few and the LLM misses context. Too many and you are not saving tokens.
I use the Kneedle algorithm on the sorted relevance scores. Plot the scores from highest to lowest. The "knee" of that curve is where you hit diminishing returns. Items above the knee are worth keeping. Items below contribute marginal information at disproportionate token cost.
from kneed import KneeLocator
scores_sorted = sorted(relevance_scores, reverse=True)
kneedle = KneeLocator(
range(len(scores_sorted)),
scores_sorted,
curve="convex",
direction="decreasing",
)
optimal_k = kneedle.knee or len(scores_sorted) // 5
3. BM25 + Embedding Relevance
If there is a query context (the user's question, the tool call parameters), SmartCrusher scores each item against that query using BM25 term frequency. For cases where semantic similarity matters more than keyword overlap, an optional embedding similarity pass runs on top.
The final score for each item is a weighted combination:
final_score = (0.4 * change_point_boost) + (0.35 * bm25_score) + (0.25 * embedding_sim)
The Safety Net: CCR Integration
Here is the part I am most proud of. SmartCrusher never permanently deletes anything. Every compressed array gets stored in the CCR (Compress-Cache-Retrieve) store with a SHA-256 hash. The LLM receives the top-K items plus a retrieval marker:
[20 items shown of 1000 total. Retrieve more: copium_retrieve("abc123", "query")]
If the model needs something that was removed, it calls copium_retrieve and gets it. In practice, this retrieval happens less than 3% of the time. The statistical selection is that good.
Performance Numbers
| Input Size | Items Kept | Token Savings | Retrieval Rate |
|---|---|---|---|
| 100 items | 12-18 | 82-88% | 2.1% |
| 500 items | 15-25 | 95-97% | 2.8% |
| 1000 items | 18-30 | 97-99% | 3.4% |
The overhead? SmartCrusher processes 1000 items in under 15ms. The LLM never notices.
Why Not Just Use LLM Summarization?
I tried this first. Ask GPT-4 to summarize the array. Problems:
- You are paying tokens to save tokens (negative ROI for small arrays)
- Summarization adds latency (2-5 seconds vs 15ms)
- The summary is lossy and irreversible. If the LLM needs the original, it is gone
- The LLM hallucinates details that were not in the original data
SmartCrusher generates zero new text. Every item in its output existed in the original array, byte for byte. This is crucial for reliability.
The Rust Acceleration
The Python prototype worked but was too slow for real-time proxy use at scale. I rewrote the hot path (fingerprinting, variance computation, kneedle) in Rust and exposed it via PyO3. The Python interface stays clean:
from copium.transforms.smart_crusher import SmartCrusher
crusher = SmartCrusher(max_items=25, relevance_method="hybrid")
compressed = crusher.compress(original_array, query_context="find auth bugs")
But under the hood, the Rust code handles the O(n log n) operations.
Lessons Learned
Statistical methods beat heuristics. My first version used hand-tuned rules ("keep items with error in the text"). SmartCrusher's statistical approach works on any content without domain-specific rules.
Reversibility is non-negotiable. The moment you permanently delete context, you introduce failure modes. CCR makes compression a zero-risk operation.
The Kneedle algorithm is underrated. Finding optimal K without a training set is hard. Kneedle solves it elegantly for sorted score distributions.
BM25 is good enough for 90% of cases. I spent weeks on embedding-based similarity before realizing BM25 handles most tool outputs perfectly. Embeddings help only for semantic queries.
SmartCrusher is the core compression engine in Copium. It handles every JSON array that flows through the proxy, silently saving 70-90% of tokens while the LLM works exactly as if the full data was there.
Source: github.com/iKislay/copium
Top comments (0)