<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: HasanH47</title>
    <description>The latest articles on DEV Community by HasanH47 (@hasanh47).</description>
    <link>https://dev.to/hasanh47</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3911471%2F346b5a59-a959-4e35-9184-b31b32af8b40.jpeg</url>
      <title>DEV Community: HasanH47</title>
      <link>https://dev.to/hasanh47</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/hasanh47"/>
    <language>en</language>
    <item>
      <title>I Tried to Compress an LLM by 545x. Here's What Happened</title>
      <dc:creator>HasanH47</dc:creator>
      <pubDate>Mon, 04 May 2026 07:09:32 +0000</pubDate>
      <link>https://dev.to/hasanh47/i-tried-to-compress-an-llm-by-545x-heres-what-happened-42kb</link>
      <guid>https://dev.to/hasanh47/i-tried-to-compress-an-llm-by-545x-heres-what-happened-42kb</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;A solo dev's journey questioning a 40-year-old assumption in deep learning&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Question That Started It All
&lt;/h2&gt;

&lt;p&gt;I was frustrated.&lt;/p&gt;

&lt;p&gt;VS Code was getting heavier on my laptop. Cursor wanted $20/month. The best AI agents were owned by 5 mega-corporations. As a developer in Indonesia, I sometimes felt we were perpetual consumers, never creators.&lt;/p&gt;

&lt;p&gt;So I asked Claude: "Can AI be smaller?"&lt;/p&gt;

&lt;p&gt;That conversation led somewhere unexpected. We started questioning the most fundamental assumption in deep learning since 1986:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Do weights have to be stored as matrices of numbers?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Think about it. A human brain doesn't store information as numbers. A seed doesn't contain all the leaves of a tree inside it — a seed contains &lt;em&gt;instructions&lt;/em&gt; to grow leaves.&lt;/p&gt;

&lt;p&gt;What if AI weights could be &lt;strong&gt;grown&lt;/strong&gt; from a small seed when needed, instead of stored as massive matrices? A 30B model could fit on a smartphone. No cloud needed. No subscription. No billion-dollar hardware.&lt;/p&gt;

&lt;p&gt;I named the project &lt;strong&gt;WIJI&lt;/strong&gt; — "seed" in Javanese. The Javanese script: ꦮꦶꦗꦶ&lt;/p&gt;

&lt;p&gt;Slogan: &lt;em&gt;"Memaksimalkan yang minimal"&lt;/em&gt; — maximize the minimal.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;I'm a solo developer. No PhD. No GPU cluster. Just a laptop, curiosity, and AI as a research collaborator.&lt;/p&gt;

&lt;p&gt;The plan:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Take TinyLlama 1.1B (small enough to run on CPU)&lt;/li&gt;
&lt;li&gt;Try to reconstruct its weights using a tiny generator network&lt;/li&gt;
&lt;li&gt;Replace original weights with generated ones&lt;/li&gt;
&lt;li&gt;See if the model still works&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If single layer works, scale up. If not, learn why and pivot.&lt;/p&gt;




&lt;h2&gt;
  
  
  Experiment 1: Single Matrix
&lt;/h2&gt;

&lt;p&gt;I started with one weight matrix: &lt;code&gt;o_proj&lt;/code&gt; of layer 0. It has 4.2 million parameters.&lt;/p&gt;

&lt;p&gt;I built a coordinate-based MLP generator: input is &lt;code&gt;(row, col)&lt;/code&gt; coordinates, output is the weight value at that position. The generator has only 164K parameters — &lt;strong&gt;25x smaller&lt;/strong&gt; than the target.&lt;/p&gt;

&lt;p&gt;Training was simple: sample random coordinates, predict their values, minimize MSE.&lt;/p&gt;

&lt;p&gt;After 5000 steps, MSE settled at 0.000067. I reconstructed the full matrix and replaced the original in the model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test prompt&lt;/strong&gt;: "What is the capital of Indonesia?"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Original output&lt;/strong&gt;: "Indonesia's capital is Jakarta."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reconstructed output&lt;/strong&gt;: "The capital of Indonesia is Jakarta."&lt;/p&gt;

&lt;p&gt;It worked. Different words, same meaning. The model still functioned with weights compressed 25x.&lt;/p&gt;

&lt;p&gt;I was elated. &lt;strong&gt;Phase 0 looked promising.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Experiment 2: Going Big
&lt;/h2&gt;

&lt;p&gt;If 1 layer works, why not all 22 layers?&lt;/p&gt;

&lt;p&gt;I added a layer embedding to the generator so it could handle multiple layers. Same 164K params, but now needed to represent &lt;strong&gt;22 different weight distributions&lt;/strong&gt; — 92M parameters total.&lt;/p&gt;

&lt;p&gt;That's 545x compression.&lt;/p&gt;

&lt;p&gt;I trained for 3000 steps. MSE settled at 0.000234 — only 3x higher than experiment 1. Should be fine, right?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;: &lt;code&gt;"Ingatescripturecordialoisimoisequalifiesearchivedeastern Discogs, and"&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Complete gibberish.&lt;/p&gt;

&lt;p&gt;This was my first lesson: &lt;strong&gt;MSE Loss is not a reliable predictor for LLM output quality.&lt;/strong&gt; The loss only got 3x worse, but the output collapsed entirely.&lt;/p&gt;




&lt;h2&gt;
  
  
  Experiment A: Diagnostic
&lt;/h2&gt;

&lt;p&gt;Before scaling more, I needed to understand: was the failure because of multi-component (Q/K/V/O matrices in one layer) or multi-layer (across 22 layers)?&lt;/p&gt;

&lt;p&gt;So I tested: single generator handling all 4 attention components in just &lt;strong&gt;layer 0&lt;/strong&gt;. 4 matrices, but still 1 layer.&lt;/p&gt;

&lt;p&gt;MSE settled at 0.000400 — &lt;strong&gt;6x higher&lt;/strong&gt; than experiment 1.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;: "The capital of Indonesia is Jakarta."&lt;/p&gt;

&lt;p&gt;Still functional. So the issue wasn't multi-component. The issue was multi-layer.&lt;/p&gt;

&lt;p&gt;Now I had a hypothesis: &lt;strong&gt;error compounds across layers&lt;/strong&gt;. Each layer's small error becomes the next layer's wrong input, which produces bigger errors, until the model collapses.&lt;/p&gt;




&lt;h2&gt;
  
  
  Experiment B2: Microservices for Layers
&lt;/h2&gt;

&lt;p&gt;If one generator can't handle 22 layers, what if I built 22 separate generators? One per layer, each specializing.&lt;/p&gt;

&lt;p&gt;22 generators × 164K params = 3.6M total. Compression: 25x. Same as experiment 1.&lt;/p&gt;

&lt;p&gt;I trained each generator for 1000 steps. The training logs revealed something important:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;MSE Loss&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;0.000063&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;0.000203&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;0.000225&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;0.000244&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;21&lt;/td&gt;
&lt;td&gt;0.000373&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Error increases monotonically from early layers to late layers.&lt;/strong&gt; Same generator capacity, same training budget, but layer 21 was 6x harder to fit than layer 0.&lt;/p&gt;

&lt;p&gt;This makes sense: late layers in transformers capture complex semantic patterns. Early layers capture simple syntax. A small generator can fit the latter but struggles with the former.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;: "WHEREASPark. ."&lt;/p&gt;

&lt;p&gt;Failure. But informative failure.&lt;/p&gt;




&lt;h2&gt;
  
  
  Experiment B3: Adaptive Capacity
&lt;/h2&gt;

&lt;p&gt;If late layers need more capacity, give them more capacity:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Layers 0-7: 128 hidden dim (small)&lt;/li&gt;
&lt;li&gt;Layers 8-15: 256 hidden dim (medium)
&lt;/li&gt;
&lt;li&gt;Layers 16-21: 512 hidden dim (large)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Total params: 5.9M. Compression: 15x (smaller compression but should help).&lt;/p&gt;

&lt;p&gt;I also tripled training steps to 3000 per generator.&lt;/p&gt;

&lt;p&gt;Result for layer 21: MSE = 0.000366 (vs B2's 0.000373).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Almost identical.&lt;/strong&gt; 10x more capacity, 3x more training, virtually no improvement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output&lt;/strong&gt;: &lt;code&gt;&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;&lt;/code&gt; (degenerate loop)&lt;/p&gt;

&lt;p&gt;This was the most important finding of the entire project:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;There's a fundamental limit. MSE plateau at ~0.0003-0.0004 is independent of capacity and training time.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This phenomenon has a name in research: &lt;strong&gt;spectral bias&lt;/strong&gt;. Neural networks with ReLU/GELU activations have an inductive bias toward smooth functions. Transformer weights look like noise — high-frequency random distributions.&lt;/p&gt;

&lt;p&gt;Throwing capacity at the problem doesn't help because the architecture itself is wrong for this task.&lt;/p&gt;




&lt;h2&gt;
  
  
  Experiment B4: The Cliff Edge
&lt;/h2&gt;

&lt;p&gt;I had 22 trained generators. Before giving up, I wanted to know: where exactly does the model fail?&lt;/p&gt;

&lt;p&gt;I ran a progressive swap test. Replace layers 0 to N-1 with generated weights. Test inference. Increment N. See what happens.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;N (layers replaced)&lt;/th&gt;
&lt;th&gt;Output&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;"The capital of Indonesia is Jakarta."&lt;/td&gt;
&lt;td&gt;✅ Perfect&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;"The capital of Ia is 10."&lt;/td&gt;
&lt;td&gt;⚠️ Partial collapse&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;"" (empty)&lt;/td&gt;
&lt;td&gt;❌ Collapse&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;"" (empty)&lt;/td&gt;
&lt;td&gt;❌ Collapse&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;"" (empty)&lt;/td&gt;
&lt;td&gt;❌ Collapse&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;"" (empty)&lt;/td&gt;
&lt;td&gt;❌ Collapse&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;22&lt;/td&gt;
&lt;td&gt;"Ingunsuretournalty. WHERE2..."&lt;/td&gt;
&lt;td&gt;❌ Gibberish&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Cliff edge between N=1 and N=3.&lt;/strong&gt; Sharp, not gradual. Phase transition.&lt;/p&gt;

&lt;p&gt;But the most counterintuitive finding: &lt;strong&gt;N=22 produces output, while N=5-16 produce empty strings.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When you replace some layers but keep others, the corrupt layers produce outputs that are "out of distribution" for the original layers. The mismatch causes probability collapse — the model produces nothing.&lt;/p&gt;

&lt;p&gt;When you replace ALL layers, the corruption is internally consistent. The model still produces gibberish, but it produces something.&lt;/p&gt;

&lt;p&gt;The lesson: &lt;strong&gt;internal consistency matters more than absolute correctness.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Actually Learned
&lt;/h2&gt;

&lt;p&gt;After 5 experiments and many hours of failure, here's what I have:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Validated empirically&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;✅ Weight matrices have significant redundancy (compress 25-56x for single layer)&lt;/li&gt;
&lt;li&gt;✅ MSE Loss is a misleading metric for LLM compression quality&lt;/li&gt;
&lt;li&gt;✅ Cliff edge phenomenon exists at N=2 layers&lt;/li&gt;
&lt;li&gt;✅ Capacity scaling doesn't solve spectral bias&lt;/li&gt;
&lt;li&gt;✅ Internal consistency &amp;gt; absolute correctness in deep networks&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Open questions for next phase&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Can Fourier features (NeRF-style positional encoding) overcome spectral bias?&lt;/li&gt;
&lt;li&gt;Are FFN layers easier to reconstruct than attention layers?&lt;/li&gt;
&lt;li&gt;Can output-aware loss (KL divergence) replace MSE?&lt;/li&gt;
&lt;li&gt;Does cliff edge shift with bigger models?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Honest probability assessment&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;40% that the next phase finds something useful&lt;/li&gt;
&lt;li&gt;25% that we get a working prototype&lt;/li&gt;
&lt;li&gt;5-10% that this leads to a genuine breakthrough&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But 10% × "fundamentally change AI deployment" = high expected value for a solo dev with AI as collaborator.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why I'm Sharing This
&lt;/h2&gt;

&lt;p&gt;I could have buried these failures and only shared the success. That's the temptation.&lt;/p&gt;

&lt;p&gt;Instead, I'm publishing everything: code, failures, insights, and methodology. Why?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Negative results are valuable.&lt;/strong&gt; Someone else attempting this will save weeks knowing where the cliff edge is.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Open source is legacy.&lt;/strong&gt; Even if I stop maintaining this, the experiments stay accessible forever.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solo dev + AI is a new research methodology.&lt;/strong&gt; I want to demonstrate what's possible. Other developers in Indonesia, in developing countries, in their bedrooms — they can ask hard questions and explore them. They don't need to wait for FAANG employment to contribute to AI research.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Centralization is the enemy.&lt;/strong&gt; AI is concentrating into the hands of 5 corporations. If we accept that, our future is dystopian. WIJI is a contrarian bet — that intelligence can be made minimal, affordable, and owned by everyone.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;Phase 1 plans:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Fourier Features experiment&lt;/strong&gt; — likely solves spectral bias based on NeRF research&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FFN layer test&lt;/strong&gt; — different weight distribution, possibly easier to compress&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Streaming inference system&lt;/strong&gt; — pragmatic system that accepts N=1 limit but works&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rust port&lt;/strong&gt; — for proper performance benchmarking&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I'll publish results as I go. Failures and successes both.&lt;/p&gt;

&lt;p&gt;If you're interested in this kind of research, the repo is fully open:&lt;/p&gt;

&lt;p&gt;🔗 &lt;strong&gt;github.com/sangkan-dev/wiji-experimental&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Critique welcome. Collaboration welcome. Even philosophical disagreement welcome.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Mari kita lebih menggila di dunia yang udah gila ini."&lt;/em&gt;&lt;br&gt;&lt;br&gt;
&lt;em&gt;Let's get crazier in this already-crazy world.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;strong&gt;About the author&lt;/strong&gt;: HasanH47, a DevOps Enginner based in Yogyakarta, Indonesia. Building products at the intersection of local context and frontier technology. Project under &lt;a href="https://github.com/sangkan-dev" rel="noopener noreferrer"&gt;Sangkan&lt;/a&gt; organization.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;If this article gave you something to think about, consider following for updates on Phase 1 results. And if you're working on anything related, please reach out.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>ai</category>
      <category>deeplearning</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
