Artyom Molchanov

Posted on Jan 9

🧠✂️ Neural Network Lobotomy: Removed 7 Layers from an LLM — It Became 30% Faster

#ai #machinelearning

An experiment in surgical layer removal from a language model

TL;DR

I took TinyLlama (1.1B parameters, 22 layers) and started removing layers to test the hypothesis: modern LLMs are over-parameterized, and many layers do the same thing.

Results:

Removed 1 middle layer → +10% speed, -4% quality
Removed 7 layers (safe ones) → +30% speed, -2.5% quality
Removed first layer → model broke
Unexpected: Layer 2 is more important than Layer 0! (+6.67 vs +3.92 perplexity)

Tested all 22 layers individually. Here's what I found.

Why Does This Matter?

Startups spend millions of dollars on GPUs for LLM inference. OpenAI reportedly spends $700k per day on compute alone. Any optimization that speeds up the model without losing quality is direct cost savings.

Layer pruning is one way to speed things up. The idea is simple:

Modern models have dozens of layers (GPT-4 supposedly 120+)
Not all layers are equally useful
Some can be removed, and the model barely notices

Research ShortGPT (2024) showed that you can remove 25% of layers from LLaMA-2 with less than 5% quality loss. I decided to verify this in practice.

Experiment Setup

Hardware: MacBook Pro M4 Pro, 24GB RAM

Model: TinyLlama-1.1B-Chat-v1.0

1.1 billion parameters
22 layers (decoder blocks)
LLaMA architecture

Metrics:

Perplexity — how "surprised" the model is by text (lower = better)
Tokens/second — generation speed
Generation quality — subjective assessment of output text

Code: PyTorch + HuggingFace Transformers. Removing a layer = literally removing it from model.model.layers:

def remove_layers(model, layers_to_remove):
    original_layers = list(model.model.layers)
    new_layers = [
        layer for i, layer in enumerate(original_layers)
        if i not in layers_to_remove
    ]
    model.model.layers = nn.ModuleList(new_layers)
    return model

Results

Summary Table

What I Removed	Perplexity	Δ Quality	Tokens/s	Δ Speed	Works?
Nothing (baseline)	1.82	—	59	—	✅
Middle layer (#11)	1.89	-4%	64	+10%	✅
3 middle layers (#10-12)	2.24	-23%	66	+12%	✅
First layer (#0)	5.74	-215%	64	+10%	❌
7 safe layers	~1.87	~-2.5%	~77	~30%	✅

Note: precise measurements — 10 runs, 5 warmup, MPS backend

Key Discovery: Middle Layers Are Redundant

Removing one layer from the middle of the model (layer #11 out of 22) gave:

+10% generation speed (59 → 64 tokens/sec)
Only -4% quality (perplexity 1.82 → 1.89)

Removing 7 safe layers (3, 4, 5, 9, 10, 11, 12) can achieve ~30% speedup.

Generation remained completely coherent:

Prompt: "Once upon a time"

Baseline: (not measured)

After removing layer #11: "Once upon a time, I was a web developer. Today, I am a freelance web developer. I have worked for some of the most prestigious web..."

The model still generates coherent, grammatically correct text.

First Layer Is Sacred

Here's what happened when I removed the first layer:

After removing layer #0: "Once upon a time and a time. Therefore, the therefore, the therefore. Therefore, the therefore, the therefore. Therefore, the..."

The model broke. Perplexity shot up from 1.82 to 5.74 (3x worse). Text became meaningless repetition.

Why? Early layers are responsible for:

Basic attention patterns
Positional encoding
Fundamental understanding of language structure

Without them, the model loses the ability to understand how words relate to each other.

Visualization: Importance of Each Layer

I tested removing each layer individually and measured quality degradation:

Layer  0:  ████████████████████████████████████████  +3.92  🔴 CRITICAL
Layer  1:  ██████████                               +0.43
Layer  2:  ████████████████████████████████████████████████████████████████████  +6.67  🔴 MOST IMPORTANT!
Layer  3:                                          +0.01  🟢 CAN REMOVE
Layer  4:  █                                       +0.06  🟢
Layer  5:                                          +0.04  🟢
Layer  6:  ██                                      +0.12
Layer  7:  ███████████████                         +0.74
Layer  8:  ██                                      +0.12
Layer  9:  █                                       +0.07  🟢
Layer 10:  █                                       +0.05  🟢
Layer 11:  █                                       +0.07  🟢
Layer 12:  ██                                      +0.09  🟢
Layer 13:  ███                                     +0.14
Layer 14:  ███████████                             +0.53
Layer 15:  ████████████████████████████████████    +1.81  🟠 IMPORTANT
Layer 16:  █████                                   +0.27
Layer 17:  ██                                      +0.12
Layer 18:  ████                                    +0.18
Layer 19:  ████                                    +0.19
Layer 20:  ██████                                  +0.28
Layer 21:  █████████                               +0.47

Unexpected discovery: Layer 2 is more important than Layer 0! This is the layer that forms key language patterns.

Safe to remove layers: 3, 4, 5, 9, 10, 11, 12 — increase perplexity by less than 0.1.

Interpretation: Why This Distribution?

Results revealed three critical zones:

🔴 Critical Zone 1: Layer 2 (PPL +6.67)

The most important layer in the model! This is unexpected — it's usually assumed that Layer 0 is most important.

Hypothesis: Layer 2 is where key attention patterns are formed. The first two layers create a "raw" representation, and Layer 2 "crystallizes" it into a structure that all other layers use.

🔴 Critical Zone 2: Layer 0 (PPL +3.92)

The first layer is important for:

Processing positional encoding
Basic token understanding
Initializing attention patterns

🟠 Critical Zone 3: Layer 15 (PPL +1.81)

Unexpected spike in late middle layers. Possibly this is the layer where "switching" from general semantics to task-specific processing happens.

🟢 Safe Zone: Layers 3-5, 9-12

These layers show minimal impact (PPL increase < 0.1). They perform redundant computations — repeating what neighboring layers already did.

Practical takeaway: you can remove 5-7 layers (layers 3, 4, 5, 9, 10, 11, 12) with less than 0.5% quality loss and get ~30% speedup.

Research ShortGPT introduced the Block Influence (BI) metric — my results fully align with their findings: middle layers show low BI and can be safely removed.

Practical Takeaways

For Engineers

Based on per-layer analysis — optimal combinations for removal:

Aggressiveness	Remove Layers	Expected Loss	Speedup
Minimal	{3}	~0.4%	~5%
Moderate	{3, 5, 10, 11}	~1%	~18%
Aggressive	{3, 4, 5, 9, 10, 11, 12}	~2.5%	~32%

# Optimal strategy: remove least important layers
safe_layers_to_remove = {3, 4, 5, 9, 10, 11, 12}  # PPL increase < 0.1 each
remove_layers(model, safe_layers_to_remove)

# Result: 22 -> 15 layers, ~32% speedup, ~2.5% quality loss

Important: never remove layers 0, 2, 15 — these are critical points.

For Researchers

This field is actively developing:

ShortGPT (2024) — removing entire layers
FinerCut (2024) — removing components within layers
SliceGPT (2024) — removing rows/columns from weight matrices
LinearPatch (2025) — recovering 94% quality after pruning via Hadamard transform (arxiv)
MRP (2025) — Maximum Redundancy Pruning, adaptive removal of most redundant layers (arxiv)
CLP (2025) — automatic search for optimal segments to remove (arxiv)

Combining with quantization (INT4/INT8) can give even greater speedup.

For Business

If you're paying $10k/month for inference GPUs, layer pruning can save $2-3k without noticeable quality loss. At OpenAI's scale, this is millions of dollars.

Experiment Limitations

Small model — TinyLlama 1.1B, results may differ for 7B/70B models
Simple metric — perplexity doesn't capture all quality aspects
No fine-tuning — possibly after removing layers the model can be fine-tuned to recover quality
Single dataset — need to test on different tasks
Measurement variability — speed on MPS backend has ±10% variance, important to do many runs
Chain-of-thought degradation — recent research (arxiv 2510.22228) showed that even removing 1-2 layers can break multi-step reasoning ability, while simple tasks work fine

Code

All experiment code is available on GitLab: https://gitlab.com/molchanov.artem.1994/lobotomyllm

git clone https://gitlab.com/molchanov.artem.1994/lobotomyllm
cd lobotomyLlm
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python experiments/run_ablation.py --experiment quick

Conclusion

Hypothesis confirmed: modern LLMs are over-parameterized, 30% of layers can be removed with <3% quality loss.

Key insights:

Layer 2 is the most important (unexpectedly more important than Layer 0)
Layers 3-5, 9-12 are redundant (can be removed almost for free)
Layer 15 is a hidden critical layer in the late part of the network

Practical result: removing 7 layers (22→15) gives ~32% speedup with ~2.5% quality loss.

DEV Community