HasanH47

Posted on May 4

I Tried to Compress an LLM by 545x. Here's What Happened

#machinelearning #ai #deeplearning #opensource

A solo dev's journey questioning a 40-year-old assumption in deep learning

The Question That Started It All

I was frustrated.

VS Code was getting heavier on my laptop. Cursor wanted $20/month. The best AI agents were owned by 5 mega-corporations. As a developer in Indonesia, I sometimes felt we were perpetual consumers, never creators.

So I asked Claude: "Can AI be smaller?"

That conversation led somewhere unexpected. We started questioning the most fundamental assumption in deep learning since 1986:

Do weights have to be stored as matrices of numbers?

Think about it. A human brain doesn't store information as numbers. A seed doesn't contain all the leaves of a tree inside it — a seed contains instructions to grow leaves.

What if AI weights could be grown from a small seed when needed, instead of stored as massive matrices? A 30B model could fit on a smartphone. No cloud needed. No subscription. No billion-dollar hardware.

I named the project WIJI — "seed" in Javanese. The Javanese script: ꦮꦶꦗꦶ

Slogan: "Memaksimalkan yang minimal" — maximize the minimal.

The Setup

I'm a solo developer. No PhD. No GPU cluster. Just a laptop, curiosity, and AI as a research collaborator.

The plan:

Take TinyLlama 1.1B (small enough to run on CPU)
Try to reconstruct its weights using a tiny generator network
Replace original weights with generated ones
See if the model still works

If single layer works, scale up. If not, learn why and pivot.

Experiment 1: Single Matrix

I started with one weight matrix: o_proj of layer 0. It has 4.2 million parameters.

I built a coordinate-based MLP generator: input is (row, col) coordinates, output is the weight value at that position. The generator has only 164K parameters — 25x smaller than the target.

Training was simple: sample random coordinates, predict their values, minimize MSE.

After 5000 steps, MSE settled at 0.000067. I reconstructed the full matrix and replaced the original in the model.

Test prompt: "What is the capital of Indonesia?"

Original output: "Indonesia's capital is Jakarta."

Reconstructed output: "The capital of Indonesia is Jakarta."

It worked. Different words, same meaning. The model still functioned with weights compressed 25x.

I was elated. Phase 0 looked promising.

Experiment 2: Going Big

If 1 layer works, why not all 22 layers?

I added a layer embedding to the generator so it could handle multiple layers. Same 164K params, but now needed to represent 22 different weight distributions — 92M parameters total.

That's 545x compression.

I trained for 3000 steps. MSE settled at 0.000234 — only 3x higher than experiment 1. Should be fine, right?

Output: "Ingatescripturecordialoisimoisequalifiesearchivedeastern Discogs, and"

Complete gibberish.

This was my first lesson: MSE Loss is not a reliable predictor for LLM output quality. The loss only got 3x worse, but the output collapsed entirely.

Experiment A: Diagnostic

Before scaling more, I needed to understand: was the failure because of multi-component (Q/K/V/O matrices in one layer) or multi-layer (across 22 layers)?

So I tested: single generator handling all 4 attention components in just layer 0. 4 matrices, but still 1 layer.

MSE settled at 0.000400 — 6x higher than experiment 1.

Output: "The capital of Indonesia is Jakarta."

Still functional. So the issue wasn't multi-component. The issue was multi-layer.

Now I had a hypothesis: error compounds across layers. Each layer's small error becomes the next layer's wrong input, which produces bigger errors, until the model collapses.

Experiment B2: Microservices for Layers

If one generator can't handle 22 layers, what if I built 22 separate generators? One per layer, each specializing.

22 generators × 164K params = 3.6M total. Compression: 25x. Same as experiment 1.

I trained each generator for 1000 steps. The training logs revealed something important:

Layer	MSE Loss
0	0.000063
5	0.000203
10	0.000225
15	0.000244
21	0.000373

Error increases monotonically from early layers to late layers. Same generator capacity, same training budget, but layer 21 was 6x harder to fit than layer 0.

This makes sense: late layers in transformers capture complex semantic patterns. Early layers capture simple syntax. A small generator can fit the latter but struggles with the former.

Output: "WHEREASPark. ."

Failure. But informative failure.

Experiment B3: Adaptive Capacity

If late layers need more capacity, give them more capacity:

Layers 0-7: 128 hidden dim (small)
Layers 8-15: 256 hidden dim (medium)
Layers 16-21: 512 hidden dim (large)

Total params: 5.9M. Compression: 15x (smaller compression but should help).

I also tripled training steps to 3000 per generator.

Result for layer 21: MSE = 0.000366 (vs B2's 0.000373).

Almost identical. 10x more capacity, 3x more training, virtually no improvement.

Output: (degenerate loop)

This was the most important finding of the entire project:

There's a fundamental limit. MSE plateau at ~0.0003-0.0004 is independent of capacity and training time.

This phenomenon has a name in research: spectral bias. Neural networks with ReLU/GELU activations have an inductive bias toward smooth functions. Transformer weights look like noise — high-frequency random distributions.

Throwing capacity at the problem doesn't help because the architecture itself is wrong for this task.

Experiment B4: The Cliff Edge

I had 22 trained generators. Before giving up, I wanted to know: where exactly does the model fail?

I ran a progressive swap test. Replace layers 0 to N-1 with generated weights. Test inference. Increment N. See what happens.

N (layers replaced)	Output	Status
1	"The capital of Indonesia is Jakarta."	✅ Perfect
3	"The capital of Ia is 10."	⚠️ Partial collapse
5	"" (empty)	❌ Collapse
8	"" (empty)	❌ Collapse
12	"" (empty)	❌ Collapse
16	"" (empty)	❌ Collapse
22	"Ingunsuretournalty. WHERE2..."	❌ Gibberish

Cliff edge between N=1 and N=3. Sharp, not gradual. Phase transition.

But the most counterintuitive finding: N=22 produces output, while N=5-16 produce empty strings.

When you replace some layers but keep others, the corrupt layers produce outputs that are "out of distribution" for the original layers. The mismatch causes probability collapse — the model produces nothing.

When you replace ALL layers, the corruption is internally consistent. The model still produces gibberish, but it produces something.

The lesson: internal consistency matters more than absolute correctness.

What I Actually Learned

After 5 experiments and many hours of failure, here's what I have:

Validated empirically:

✅ Weight matrices have significant redundancy (compress 25-56x for single layer)
✅ MSE Loss is a misleading metric for LLM compression quality
✅ Cliff edge phenomenon exists at N=2 layers
✅ Capacity scaling doesn't solve spectral bias
✅ Internal consistency > absolute correctness in deep networks

Open questions for next phase:

Can Fourier features (NeRF-style positional encoding) overcome spectral bias?
Are FFN layers easier to reconstruct than attention layers?
Can output-aware loss (KL divergence) replace MSE?
Does cliff edge shift with bigger models?

Honest probability assessment:

40% that the next phase finds something useful
25% that we get a working prototype
5-10% that this leads to a genuine breakthrough

But 10% × "fundamentally change AI deployment" = high expected value for a solo dev with AI as collaborator.

Why I'm Sharing This

I could have buried these failures and only shared the success. That's the temptation.

Instead, I'm publishing everything: code, failures, insights, and methodology. Why?

Negative results are valuable. Someone else attempting this will save weeks knowing where the cliff edge is.

Open source is legacy. Even if I stop maintaining this, the experiments stay accessible forever.

Solo dev + AI is a new research methodology. I want to demonstrate what's possible. Other developers in Indonesia, in developing countries, in their bedrooms — they can ask hard questions and explore them. They don't need to wait for FAANG employment to contribute to AI research.

Centralization is the enemy. AI is concentrating into the hands of 5 corporations. If we accept that, our future is dystopian. WIJI is a contrarian bet — that intelligence can be made minimal, affordable, and owned by everyone.

What's Next

Phase 1 plans:

Fourier Features experiment — likely solves spectral bias based on NeRF research
FFN layer test — different weight distribution, possibly easier to compress
Streaming inference system — pragmatic system that accepts N=1 limit but works
Rust port — for proper performance benchmarking

I'll publish results as I go. Failures and successes both.

If you're interested in this kind of research, the repo is fully open:

🔗 github.com/sangkan-dev/wiji-experimental

Critique welcome. Collaboration welcome. Even philosophical disagreement welcome.

"Mari kita lebih menggila di dunia yang udah gila ini."

Let's get crazier in this already-crazy world.

About the author: HasanH47, a DevOps Enginner based in Yogyakarta, Indonesia. Building products at the intersection of local context and frontier technology. Project under Sangkan organization.

If this article gave you something to think about, consider following for updates on Phase 1 results. And if you're working on anything related, please reach out.

DEV Community