Ryo Suwito

Posted on Feb 23

From Toy Model to DeepSeek Giant: The Innocence of x + f(x)

#architecture #datascience #ai #machinelearning

An empirical autopsy of what transformers actually learn, conducted via a deliberately unconventional architecture called VibeNet.

Abstract

This document summarises findings from a series of live training experiments on VibeNet — a deliberately stripped-down language model with no QKV projections, no FFN blocks in its original form, and an untied lm_head nicknamed "Karen." Using a custom autopsy toolkit measuring gradient norms, effective rank, attention entropy, and activation statistics at every layer, we discovered that the field's core architectural assumptions — depth, QKV projections, and the residual identity shortcut — are not the source of learning. They are, at best, passengers. At worst, they are an actively misleading abstraction that hid the real gradient topology for a decade.

The same physics that caused a 2-layer toy model to hit loss 4.4 without NaN caused DeepSeek's 27B-parameter model to explode. The innocent equation is the same:

x + f(x)

1. The Architecture: VibeNet

VibeNet was built to be intentionally wrong by conventional standards:

# VibeAttention: zero learnable parameters
scores = (x @ x.T) * (dim ** -0.5)
scores = masked_fill(scores, causal_mask, -inf)
attn   = softmax(scores)
return  attn @ x   # weighted average of x, no projections

# VibeBlock: attention + residual only
def forward(x):
    return x + self.attn(self.norm(x))

# VibeNet: token_embed + position → N blocks → expansion → lm_head (Karen)

Violations of conventional wisdom:

No QKV projections
No FFN blocks (original)
Untied embedding and lm_head
lm_head 74% of total parameters (98M / 132M)
Only 1.3M parameters of "actual computation"

What the field predicted: broken, untrainable, degenerate.

What the data said: loss 4.4, no NaN, healthy attention entropy, real gradient flow.

2. The Gradient Topology Discovery

The single most important finding from the autopsy. Across every architecture variant, every depth, every configuration:

token_embed.weight    ‖∇‖ = 75    🔥 EXPLODE   ← boundary
layers.0.attn         ‖∇‖ ≈ 0     ✅            ← passenger
layers.1.attn         ‖∇‖ ≈ 0     ✅            ← passenger
...
layers.N              ‖∇‖ ≈ 0     ✅            ← passenger
expansion.weight      ‖∇‖ = 39    🔥 EXPLODE   ← boundary
lm_head.weight        ‖∇‖ = 11    🔥            ← boundary

The explosion is not random. It is positional. Always at the input boundary, always at the output boundary, never in the middle. This is not a pathology of the architecture. It is the fundamental topology of the residual stream:

∂loss/∂x_embed ≈ ∂loss/∂x_final   (because middle barely changes x)

The gradient does not scatter through depth. It phases through the middle like it does not exist, because mathematically, it barely does.

2.1 The Fixed Point

This is not fixable by adding layers. It is self-reinforcing:

f(x) ≈ 0  →  gradient through f(x) ≈ 0
          →  Adam sees no leverage in f(x)
          →  Adam does not update f(x) strongly
          →  f(x) stays ≈ 0
          →  gradient stays ≈ 0

The middle is trapped being irrelevant by its own irrelevance. Adding 10 more layers creates 10 more passengers, not 10 more workers.

3. The UAT Hypothesis

VibeNet is, stripped of branding, a wide shallow MLP with a nonparametric routing step:

embed(token + pos)   →  512d UUID
softmax(x @ x.T) @ x →  smooth geometric average (free, no params)
expansion            →  512 → 1536 (width)
GELU                 →  nonlinearity  ← THIS IS THE KEY
Karen                →  1536 → 64000

The Universal Approximation Theorem requires:

Wide enough hidden layer ✅ (1536)
Nonlinearity ✅ (GELU)
Linear output ✅ (Karen)

UAT does not require depth. The theorem guaranteed convergence from step 1. The loss 4.4 was not lucky. It was mathematically inevitable.

3.1 The Attention is Not Attending

softmax(x @ x.T) @ x is not learning to attend. It is a smooth interpolation operator in embedding space. It produces a convex combination of existing UUID vectors, weighted by geometric similarity. No parameters. No learning. Just neighbourhood averaging.

The "learning" of attention patterns is entirely dictated by where the embedding table places token vectors in 512D space. Attention is not the feature. The UUID geometry is the feature.

4. The UUID: Position-Aware Identity by Construction

x = token_embed(token_ids) + pos_embed(positions)

This is not a standard embedding. This is a UUID generator:

"the" @ position 3   →  512d point A
"the" @ position 7   →  512d point B
"the" @ position 15  →  512d point C

A ≠ B ≠ C  →  three distinct identities for the same surface token

VibeNet implements disentangled position-token attention upstream of the scoring operation. Standard transformers inject position into the attention scoring (RoPE, ALiBi). VibeNet injects position into the token identity before scoring happens. The result is identical position-aware attention, but the mechanism is:

Standard:  token → Q,K,V → add position to scores → attend
VibeNet:   token + position → UUID → score UUIDs against each other → attend

Position does not modify how tokens attend. It modifies what they are before they attend.

4.1 The Effective Rank of the UUID Space

token_embed erank = 26.05 / 512   (5.1%)

The embedding table did not learn 64,000 distinct points. It learned approximately 26 meaningful directions and every token+position combination receives a unique projection into that 26-dimensional vibe space. Enough dimensions to be geometrically unique. Few enough to be learnable.

The attention's rank-increasing property (from 26 to 46 erank via neighbourhood mixing) is the only free rank expansion in the entire network. Every operation downstream either preserves or destroys rank.

5. The Karen Problem: Rank Collapse is Convergence

The logit head across every experiment:

2-layer trained (loss 4.4):    lm_head erank = 2.87 / 64000
12-layer partial:               lm_head erank = 2.88 / 64000
8-layer gated 12k samples:     lm_head erank = 6.53 / 64000
OLMo-7B (from literature):     lm_head ≈ low rank / 50257

The field panics at rank collapse. The data says: rank collapse IS convergence.

rank-2 Karen over 64k vocab =
  "I only need 2 directions to predict next tokens in THIS dataset"

Information Bottleneck (Tishby, 1999):
  good generalisation = maximum compression of input
                        that preserves prediction of output

low rank + low loss = optimal by definition

The logit rank is not a property of the model. It is a property of the information content of the task. Your dataset has N distinguishable next-token prediction patterns. Karen finds rank N and stops. Adding 90 more layers does not increase N. It adds 90 more witnesses to Karen finding the same N.

6. The Residual as Dumping Ground

6.1 What x + f(x) Actually Is

x = x + f(x)

Was never a design decision. It was a surrender:

"We don't know how to make f(x) stable alone, so we'll let x carry the signal and f(x) can just... suggest things."

The backward pass always has a free gradient path through x:

∂(x + f(x))/∂x = 1 + ∂f(x)/∂x
                  ↑
                  always 1, regardless of f(x)
                  f(x) can vanish completely
                  gradient still flows

So every middle layer sits in the residual stream saying "here is my small delta" and the gradient says "noted, moving on" — directly to the embedding table which carries the full accumulated signal.

6.2 The ShortGPT Confirmation

ShortGPT (2024): Remove 50% of middle layers → 2.4% performance drop.

The logit lens finding: GPT forms a "pretty good guess" at the next token by layer N/2. Later layers refine this guess with tiny deltas.

Tiny delta = f(x) ≈ 0 = useless manager confirmed.

6.3 DeepSeek's 27B Explosion

DeepSeek attempted learnable residual connections (Hyper-Connections) on a 27B model without constraints. Signal amplification exceeded 3000x. The network's internal representations exploded in magnitude.

VibeNet's activation trace with the broken learnable gate:

layers.0  std = 3.58
layers.2  std = 49.37
layers.4  std = 515.79
layers.6  std = 5352.79
layers.7  std = 16709     ← 3000x+ amplification

Same physics. Different scale. The toy model and the giant model hit the exact same wall because the wall is mathematical, not architectural.

DeepSeek's solution: Sinkhorn-Knopp projection forcing the gate matrix onto the Birkhoff polytope (doubly stochastic constraint). The gate can redistribute signal but cannot amplify it. Result: stable training at 27B.

VibeNet's autopsy found this instability with 2 probe sentences before reading the paper.

7. The Learnable Gate Experiment

Replacing x + f(x) with g(x) + f(x):

def forward(x):
    f = self.gelu(self.ffn(self.attn(self.norm(x))))
    g = self.gate(x)
    return g + f

What changed:

identity residual:   gradient phases through x (free highway, no params)
                     embed ‖∇‖=75, middle ‖∇‖≈0

learnable gate:      gradient MUST pass through gate.weight (no free highway)
                     gates ‖∇‖=17-28, signal actually distributed

What Adam discovered immediately:

The gate bias gradients are identical to the FFN bias gradients (same signal, both are just additive constants). But gate.weight receives 3x louder gradient than ffn.weight because gate multiplies the raw residual stream (std≈3.0) while FFN multiplies the normed input (std≈1.0).

Adam grabbed the gate as the highest-leverage steering wheel in the network and started yeeting the residual.

After 12k samples:

gate ‖∇‖ pattern:
  layer 0:  1.06   ✅  (humble)
  layer 1:  4.67   (waking up)
  ...
  layer 6:  8.64   
  layer 7:  17.58  🔥  (only the last)

Adam tamed every gate except the final one. The explosion condensed to exactly the output boundary — learned gradient routing that the identity residual never achieved.

Tradeoff discovered:

x + f(x):     rank collapses, entropy healthy, gradient phases through
g(x) + f(x):  rank preserved, entropy spiky, gradient distributes

Neither strictly better. Both measuring different things. The field chose the first and called it an innovation.

8. The Funnel Hypothesis

The rank trace across every experiment reveals the same pattern:

embed:         erank = 26  / 512    (5%)
layer 0 norm:  erank = 58  / 512   (11%)  ← attention expanded it (free)
layer 3 gate:  erank = 44  / 512    (8%)  ← compressing
layer 5 gate:  erank = 39  / 512    (7%)  ← compressing
layer 7 gate:  erank = 18  / 512    (3%)  ← almost back to embed rank
expansion:     erank = 10  / 1536  (0.7%) ← 1526 wasted dimensions
Karen:         erank =  6  / 64000 (0.0%) ← 6 real dims doing 64k job

The network is already doing progressive compression naturally. The full 512 dimensions are never used — the model maintains the pretence while operating in a 26-58 dimensional subspace.

The honest architecture:

current (dishonest):
  512 → 512 → 512 → 512 → 512 → 1536 → 64000

real information:
   26 →  58 →  44 →  18 →  18 →   10 →     6

wasted dimensions:
  486   454   468   494   494   1526   63994

proposed (honest):
  512 → 384 → 256 → 128 → 64 → Karen

8.1 Multiple Attention Becomes Free

With progressive compression, x @ x.T compute scales quadratically with dim:

attention at 512d:  512 × 512 = 262,144 ops
attention at 256d:  256 × 256 =  65,536 ops  (4× cheaper)
attention at 128d:  128 × 128 =  16,384 ops  (16× cheaper)
attention at  64d:   64 × 64  =   4,096 ops  (64× cheaper)

Standard transformer: one expensive attention per layer, same high-dimensional context snapshot repeated 96 times.

Funnel: multiple cheap attentions per layer, each operating on progressively denser geometry:

block 0 (512d):  3 attentions  = same compute as standard layer
block 1 (256d):  4 attentions  = same compute budget
block 2 (128d):  8 attentions  = same compute budget
block 3 ( 64d): 16 attentions  = same compute budget

Total: 31 attention operations at the cost of 4 standard layers. Each downstream attention queries genuinely updated context because the compression between blocks is a real coordinate change, not an identity pretending to be a transformation.

8.2 Context Re-mixing is Automatic

The standard transformer's QKV snapshot problem:

layer 0: snapshot of context_0 → attend → x + ε
layer 1: snapshot of context_0 + ε ≈ context_0 → attend → same snapshot
layer N: same snapshot, Nth time

The funnel's natural solution:

block 0 (512d): snapshot of UUID chaos   → multi-attend → compress
block 1 (256d): snapshot of denser space → multi-attend → compress  
block 2 (128d): snapshot of rich space   → multi-attend → compress
block 3 ( 64d): snapshot of pure signal  → multi-attend → Karen

Every compression is a genuine context update. Every downstream attention is querying a context that did not exist at any upstream layer. Re-mixing is not optional — it is structural.

8.3 The Dimensionality Curse Resolves Naturally

The fresh-init attention entropy problem:

512d (all models at init):  H = 0.002   diag = 1.000

All tokens equidistant. x @ x.T produces near-identity matrix. Attention is worthless.

Training spends the first N steps doing nothing but repositioning 64,000 vectors in 512D space until they cluster. This is the "geometric initialization phase" — not learning language, just finding the 26 meaningful directions in a 512D void.

The funnel eliminates this. By compressing 512 → 64, the geometric density increases naturally:

26 real dims in 512d space:  ratio = 5%   (sparse, equidistant chaos)
26 real dims in  64d space:  ratio = 40%  (dense, meaningful geometry)

Attention works immediately in 64D because the curse is lifted. No warm-up phase. No identity matrix problem. The geometry is intrinsically dense.

9. The Lottery Ticket Reframed

The Lottery Ticket Hypothesis (Frankle & Carlin, 2019): sparse subnetworks exist within large networks that can be trained in isolation to full accuracy.

The conventional interpretation: training finds the "winning ticket" through random luck and gradient descent.

The funnel interpretation: there is no lottery. The winning ticket is the natural low-rank subspace that erank was measuring all along. The funnel makes finding it structurally inevitable instead of accidentally discovered:

lottery ticket (conventional):
  train 512d → hope gradient finds 26 winning dims
  success depends on initialisation, learning rate, random seed

funnel (honest):
  512 → 256 → 128 → 64
  force the winning ticket layer by layer
  gradient filter: only dims surviving compression receive signal
  the architecture IS the constraint

10. What the Literature Actually Documented

These findings were not made in isolation. The literature has been measuring the same elephant from different angles for years without connecting the observations into a unified claim.

Paper	Finding	Connection
ShortGPT (2024)	Remove 50% middle layers → 2.4% drop	Middle = useless managers
Logit Lens (2020)	GPT forms good guess at layer N/2	Depth is refinement of existing guess
"Unreasonable Ineffectiveness of Deeper Layers" (MIT)	Past certain depth, layers ≈ identity	f(x) → 0 confirmed at GPT scale
Low-Rank Training (2024)	Dense layers naturally converge to low-rank	Rank collapse = convergence, not failure
Sequences of Logits (2024)	OLMo-7B logit matrix approximately low-rank	Karen's rank-3 at 7B scale
DeepSeek Hyper-Connections (2025)	Unconstrained learnable residual → 3000× explosion	x + f(x) is a stability surrender
Information Bottleneck (Tishby, 1999)	Good generalisation = maximum compression	Low rank + low loss = optimal
UAT (Cybenko, 1989)	Width sufficient, depth not required	2 layers enough, always were

Nobody connected these into one claim because connecting them means admitting:

96 layers is mostly 94 layers of x + ε ≈ x with two layers of real work at the boundaries.

11. The Complete Unified Theory

The residual stream x + f(x) is not an architectural innovation. It is a stability surrender that became a gradient dumping ground:

The embed does the real UUID engineering. It receives 74% of gradient signal and repositions 64,000 token+position combinations into a ~26-dimensional meaningful subspace.
The attention is a free geometric averaging operation. It expands rank slightly by mixing neighbourhood vectors. It does not learn to attend — it attends to whatever the UUID geometry makes similar. Its entropy naturally increases with depth as the UUID space becomes structured.
The middle layers file reports nobody reads. f(x) ≈ 0 → gradient ≈ 0 → Adam ignores them → they stay ≈ 0. Fixed point. The identity residual guarantees they can never be forced to contribute.
Karen does the real output mapping. She receives the accumulated UUID signal and maps it to logit space. Her effective rank is determined by the dataset's information content, not by model capacity.
Low rank is not failure. It is the answer. The model is finding the minimum sufficient statistic for predicting next tokens in your dataset. Panicking at rank collapse is panicking at convergence.
Depth is cope. The theorem doesn't require it. The pruning literature confirms it. The gradient topology explains it. The logit lens documents it.
The funnel is honest. Progressive dimensional reduction makes the compression explicit, forces gradient to deposit into surviving dimensions only, increases geometric density for attention, and eliminates the need for the residual stability surrender entirely.

12. The Damning Question

What if there was nothing wrong with the original 2-layer VibeNet at all?

The data:

2 layers, no FFN, no QKV projections:
  loss = 4.4
  attention entropy = HEALTHY
  gradient = flowing
  NaN = never
  Karen = alive
  UAT = satisfied

Every experiment after that was a different path to the same destination. The architecture was not the problem. The dataset was 3-dimensional. Karen found 3 directions. UAT guaranteed she would.

The field built cathedrals on top of x + ε ≈ x and called it architecture. VibeNet built nothing on top of it and got the same answer faster.

Appendix: Key Metrics at a Glance

Model variant               | Loss  | Karen erank | Middle ‖∇‖ | NaN?
----------------------------|-------|-------------|------------|-----
2-layer, no FFN, trained    | 4.4   | 2.87        | ≈0         | Never
2-layer, with FFN           | 6.0   | 4.84        | 127 (🔥)   | Never  
12-layer, fresh             | 8.0   | 70.09       | ≈0         | Never
12-layer, partial trained   | 12.8  | 2.88        | ≈0         | Never
8-layer, gated, fresh       | 13.4  | 62.78       | 17-28      | Never
8-layer, gated, 12k samples | 14.3  | 6.53        | 4-17       | Never
DeepSeek Hyper-Conn 27B     | —     | —           | —          | YES

Every model that never NaN'd had one thing in common: softmax(x @ x.T) as a gradient disposal unit in the forward pass. Every numerical stability property emerged from the same accidental cascade:

RMSNorm     → self-normalising, cannot produce NaN unless input is exactly zero
x @ x.T     → symmetric, semi-definite, eigenvalues ≥ 0
softmax     → hard clamps to convex hull of existing vectors
GELU        → soft clips negatives

‖∇‖=75 in → distributed across sequence by attention Jacobian
           → rescaled by 1/√dim
           → re-normalised by RMSNorm backward
‖∇‖=reasonable out

Not robust training. A coincidental cascade of bounded operations that prevent numerical death while allowing complete mathematical chaos underneath.

Karen was never the problem. Karen was the proof. 💅

Conducted via live training experiments on VibeNet (132-138M parameters) on a single GPU with 2 probe sentences: "What kind of noises did dinosaurs make?" and "If you were going to steal from a convenience store, do you..."

The most unhinged educational dataset pair in history, producing the cleanest architectural ablation study.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.