Week 4 done.
Last week: Shallow algorithms (Linear Regression, Logistic Regression).
This week: Neural networks - actually building and training them.
Still not LLMs. Still not ChatGPT integrations. Still "boring" ML.
But here's why: I want to understand what's actually happening, not just call APIs.
The difference? Last week I learned what models predict.
This week I learned how they learn.
The Shift: From Equations to Architectures
This week was about understanding when complexity is worth it.
What I Actually Built
1. Handwritten Digit Recognition (MNIST)
The problem: Recognize handwritten digits (0-9) from 28×28 pixel images.
import torch
import torch.nn as nn
class DigitClassifier(nn.Module):
def __init__(self):
super().__init__()
self.layers = nn.Sequential(
nn.Flatten(),
nn.Linear(784, 128),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(128, 64),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(64, 10)
)
def forward(self, x):
return self.layers(x)
# Result: 97% accuracy
In distributed systems, we build pipelines: Data → Transform → Store → Serve
Neural networks are the same: Input → Hidden Layers → Output
The transformation happens through learning, not hardcoding rules.
2. Experimenting with Layer Architectures
Dropout Layers - Forces redundant representations. Like fault-tolerant systems - if one node fails, others handle the load.
Convolutional Layers - Respects spatial structure. Same filter slides across the image (parameter sharing). Like using the same load balancing algorithm across all services.
Dense layers: 97% accuracy → Conv layers: 99.2% accuracy
BatchNorm - Stabilizes training by normalizing inputs to each layer. Like circuit breakers in microservices.
Without BatchNorm: Stuck at 85% → With BatchNorm: 97%
3. Image Denoising with Autoencoders
The architecture: Noisy Image → [Encoder] → Bottleneck → [Decoder] → Clean Image
The bottleneck (32 dimensions) is key:
- Too large (128): Memorizes noise
- Too small (8): Loses detail
- Just right (32): Learns what matters ✓
This is lossy compression with learned parameters. Like designing a caching layer - except the model learns the patterns.
Extra: In-painting (reconstructing obscured regions). Simple digits: 90% success, Complex digits: 60% success.
What Frustrated Me Most
Hyperparameter hell. Every decision affects everything - learning rate, layers, dropout, bottleneck size. I've spent years tuning JVM heaps and thread pools. This feels similar but with 10x more knobs.
Solution: Start with known-good defaults. Change one thing at a time. Keep notes.
The Debugging Moment
Problem: Autoencoder producing blurry reconstructions despite loss decreasing.
The fix: Changed optimizer from SGD to Adam.
Adam adapts learning rates per parameter. SGD uses same rate for everything. Like auto-scaling different services based on their individual load patterns.
Mistakes I Made
1. Tested on training data (again!)
Should know: Always test on unseen data.
2. Forgot model.eval()
Dropout was randomly disabling neurons during testing!
Training mode: 82% → Eval mode: 97%
3. Picked architecture randomly
5 layers? 256 neurons? Dropout 0.8? Model barely learned.
Fix: Started with proven architectures (LeNet), modified incrementally.
4. Didn't normalize input data
Raw pixels (0-255): unstable, loss exploding
Normalized (0-1): stable, converging ✓
The Pattern Recognition
What I've learned leading teams applies here:
Start simple, add complexity only when needed
Basic: 92% → +dropout: 95% → +BatchNorm: 97% → +Conv: 99%
Understand trade-offs
More layers = more capacity = slower training = higher overfitting risk
Experiment systematically
Bottleneck sizes: 8 (blurry) → 16 (better) → 32 (clean ✓) → 64 (memorizing) → 128 (overfitting)
Connection to Distributed Systems
Encoder-Decoder = Data Pipeline
The bottleneck is like network bandwidth - compress to fit through it.
Parameter Sharing = Code Reuse
One "edge detector" works everywhere. Efficient and effective.
Overfitting = Over-optimization
I've seen systems optimized for one traffic pattern that broke when patterns changed.
Solution: regularization / graceful degradation.
Time Spent This Week
About 8-10 hours this week.
What I'm Taking Forward
Neural networks aren't a stepping stone to "real" AI. They ARE real AI.
Most production ML uses these techniques. LLMs get the hype. But understanding backpropagation, gradients, and optimization matters.
I could be building LLM wrappers right now. But I wouldn't understand:
- How training actually works
- Why models fail in specific ways
- When to use what architecture
- How to debug learning problems
Starting with fundamentals means I can build real intuition.
The approach that works:
Start with fundamentals. Build something small (like MNIST). Don't start with "I'm going to build ChatGPT." Master the basics. Understand debugging. Build intuition. Then scale up.
Same advice I give for learning any new tech stack.
What's Still Hard
- Choosing architecture (Conv vs Dense? How many layers?)
- Hyperparameter tuning (still trial and error)
- Knowing when to stop (97% vs 99%?)
These feel like architectural decisions I make daily - but with less intuition.
Week 4 down. Built neural networks that actually learn.
Learning deep learning as a senior engineer? What surprised you most about the transition?
Top comments (0)