An experiment in surgical layer removal from a language model
TL;DR
I took TinyLlama (1.1B parameters, 22 layers) and started removing layers to test the hypothesis: modern LLMs are over-parameterized, and many layers do the same thing.
Results:
- Removed 1 middle layer β +10% speed, -4% quality
- Removed 7 layers (safe ones) β +30% speed, -2.5% quality
- Removed first layer β model broke
- Unexpected: Layer 2 is more important than Layer 0! (+6.67 vs +3.92 perplexity)
Tested all 22 layers individually. Here's what I found.
Why Does This Matter?
Startups spend millions of dollars on GPUs for LLM inference. OpenAI reportedly spends $700k per day on compute alone. Any optimization that speeds up the model without losing quality is direct cost savings.
Layer pruning is one way to speed things up. The idea is simple:
- Modern models have dozens of layers (GPT-4 supposedly 120+)
- Not all layers are equally useful
- Some can be removed, and the model barely notices
Research ShortGPT (2024) showed that you can remove 25% of layers from LLaMA-2 with less than 5% quality loss. I decided to verify this in practice.
Experiment Setup
Hardware: MacBook Pro M4 Pro, 24GB RAM
Model: TinyLlama-1.1B-Chat-v1.0
- 1.1 billion parameters
- 22 layers (decoder blocks)
- LLaMA architecture
Metrics:
- Perplexity β how "surprised" the model is by text (lower = better)
- Tokens/second β generation speed
- Generation quality β subjective assessment of output text
Code: PyTorch + HuggingFace Transformers. Removing a layer = literally removing it from model.model.layers:
def remove_layers(model, layers_to_remove):
original_layers = list(model.model.layers)
new_layers = [
layer for i, layer in enumerate(original_layers)
if i not in layers_to_remove
]
model.model.layers = nn.ModuleList(new_layers)
return model
Results
Summary Table
| What I Removed | Perplexity | Ξ Quality | Tokens/s | Ξ Speed | Works? |
|---|---|---|---|---|---|
| Nothing (baseline) | 1.82 | β | 59 | β | β |
| Middle layer (#11) | 1.89 | -4% | 64 | +10% | β |
| 3 middle layers (#10-12) | 2.24 | -23% | 66 | +12% | β |
| First layer (#0) | 5.74 | -215% | 64 | +10% | β |
| 7 safe layers | ~1.87 | ~-2.5% | ~77 | ~30% | β |
Note: precise measurements β 10 runs, 5 warmup, MPS backend
Key Discovery: Middle Layers Are Redundant
Removing one layer from the middle of the model (layer #11 out of 22) gave:
- +10% generation speed (59 β 64 tokens/sec)
- Only -4% quality (perplexity 1.82 β 1.89)
Removing 7 safe layers (3, 4, 5, 9, 10, 11, 12) can achieve ~30% speedup.
Generation remained completely coherent:
Prompt: "Once upon a time"
Baseline: (not measured)
After removing layer #11: "Once upon a time, I was a web developer. Today, I am a freelance web developer. I have worked for some of the most prestigious web..."
The model still generates coherent, grammatically correct text.
First Layer Is Sacred
Here's what happened when I removed the first layer:
After removing layer #0: "Once upon a time and a time. Therefore, the therefore, the therefore. Therefore, the therefore, the therefore. Therefore, the..."
The model broke. Perplexity shot up from 1.82 to 5.74 (3x worse). Text became meaningless repetition.
Why? Early layers are responsible for:
- Basic attention patterns
- Positional encoding
- Fundamental understanding of language structure
Without them, the model loses the ability to understand how words relate to each other.
Visualization: Importance of Each Layer
I tested removing each layer individually and measured quality degradation:
Layer 0: ββββββββββββββββββββββββββββββββββββββββ +3.92 π΄ CRITICAL
Layer 1: ββββββββββ +0.43
Layer 2: ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ +6.67 π΄ MOST IMPORTANT!
Layer 3: +0.01 π’ CAN REMOVE
Layer 4: β +0.06 π’
Layer 5: +0.04 π’
Layer 6: ββ +0.12
Layer 7: βββββββββββββββ +0.74
Layer 8: ββ +0.12
Layer 9: β +0.07 π’
Layer 10: β +0.05 π’
Layer 11: β +0.07 π’
Layer 12: ββ +0.09 π’
Layer 13: βββ +0.14
Layer 14: βββββββββββ +0.53
Layer 15: ββββββββββββββββββββββββββββββββββββ +1.81 π IMPORTANT
Layer 16: βββββ +0.27
Layer 17: ββ +0.12
Layer 18: ββββ +0.18
Layer 19: ββββ +0.19
Layer 20: ββββββ +0.28
Layer 21: βββββββββ +0.47
Unexpected discovery: Layer 2 is more important than Layer 0! This is the layer that forms key language patterns.
Safe to remove layers: 3, 4, 5, 9, 10, 11, 12 β increase perplexity by less than 0.1.
Interpretation: Why This Distribution?
Results revealed three critical zones:
π΄ Critical Zone 1: Layer 2 (PPL +6.67)
The most important layer in the model! This is unexpected β it's usually assumed that Layer 0 is most important.
Hypothesis: Layer 2 is where key attention patterns are formed. The first two layers create a "raw" representation, and Layer 2 "crystallizes" it into a structure that all other layers use.
π΄ Critical Zone 2: Layer 0 (PPL +3.92)
The first layer is important for:
- Processing positional encoding
- Basic token understanding
- Initializing attention patterns
π Critical Zone 3: Layer 15 (PPL +1.81)
Unexpected spike in late middle layers. Possibly this is the layer where "switching" from general semantics to task-specific processing happens.
π’ Safe Zone: Layers 3-5, 9-12
These layers show minimal impact (PPL increase < 0.1). They perform redundant computations β repeating what neighboring layers already did.
Practical takeaway: you can remove 5-7 layers (layers 3, 4, 5, 9, 10, 11, 12) with less than 0.5% quality loss and get ~30% speedup.
Research ShortGPT introduced the Block Influence (BI) metric β my results fully align with their findings: middle layers show low BI and can be safely removed.
Practical Takeaways
For Engineers
Based on per-layer analysis β optimal combinations for removal:
| Aggressiveness | Remove Layers | Expected Loss | Speedup |
|---|---|---|---|
| Minimal | {3} | ~0.4% | ~5% |
| Moderate | {3, 5, 10, 11} | ~1% | ~18% |
| Aggressive | {3, 4, 5, 9, 10, 11, 12} | ~2.5% | ~32% |
# Optimal strategy: remove least important layers
safe_layers_to_remove = {3, 4, 5, 9, 10, 11, 12} # PPL increase < 0.1 each
remove_layers(model, safe_layers_to_remove)
# Result: 22 -> 15 layers, ~32% speedup, ~2.5% quality loss
Important: never remove layers 0, 2, 15 β these are critical points.
For Researchers
This field is actively developing:
- ShortGPT (2024) β removing entire layers
- FinerCut (2024) β removing components within layers
- SliceGPT (2024) β removing rows/columns from weight matrices
- LinearPatch (2025) β recovering 94% quality after pruning via Hadamard transform (arxiv)
- MRP (2025) β Maximum Redundancy Pruning, adaptive removal of most redundant layers (arxiv)
- CLP (2025) β automatic search for optimal segments to remove (arxiv)
Combining with quantization (INT4/INT8) can give even greater speedup.
For Business
If you're paying $10k/month for inference GPUs, layer pruning can save $2-3k without noticeable quality loss. At OpenAI's scale, this is millions of dollars.
Experiment Limitations
- Small model β TinyLlama 1.1B, results may differ for 7B/70B models
- Simple metric β perplexity doesn't capture all quality aspects
- No fine-tuning β possibly after removing layers the model can be fine-tuned to recover quality
- Single dataset β need to test on different tasks
- Measurement variability β speed on MPS backend has Β±10% variance, important to do many runs
- Chain-of-thought degradation β recent research (arxiv 2510.22228) showed that even removing 1-2 layers can break multi-step reasoning ability, while simple tasks work fine
Code
All experiment code is available on GitLab: https://gitlab.com/molchanov.artem.1994/lobotomyllm
git clone https://gitlab.com/molchanov.artem.1994/lobotomyllm
cd lobotomyLlm
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python experiments/run_ablation.py --experiment quick
Conclusion
Hypothesis confirmed: modern LLMs are over-parameterized, 30% of layers can be removed with <3% quality loss.
Key insights:
- Layer 2 is the most important (unexpectedly more important than Layer 0)
- Layers 3-5, 9-12 are redundant (can be removed almost for free)
- Layer 15 is a hidden critical layer in the late part of the network
Practical result: removing 7 layers (22β15) gives ~32% speedup with ~2.5% quality loss.
Next steps:
- Run on Llama-3 8B for more convincing results
- Try pruning + quantization combination
- Investigate what critical layers (Layer 2, Layer 15) actually "know"
If you liked this β subscribe, star the GitLab repo, share with colleagues.
Questions and suggestions β in the comments or DM.
Tags: #MachineLearning #LLM #Optimization #PyTorch #NLP #DeepLearning

Top comments (0)