Raju C

Posted on Mar 29

Week 4: From Theory to Training - My First Neural Networks

#python #ai #learning #machinelearning

Week 4 done.

Last week: Shallow algorithms (Linear Regression, Logistic Regression).
This week: Neural networks - actually building and training them.

Still not LLMs. Still not ChatGPT integrations. Still "boring" ML.

But here's why: I want to understand what's actually happening, not just call APIs.

The difference? Last week I learned what models predict.
This week I learned how they learn.

The Shift: From Equations to Architectures

This week was about understanding when complexity is worth it.

What I Actually Built

1. Handwritten Digit Recognition (MNIST)

The problem: Recognize handwritten digits (0-9) from 28×28 pixel images.

import torch
import torch.nn as nn

class DigitClassifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            nn.Flatten(),
            nn.Linear(784, 128),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(64, 10)
        )

    def forward(self, x):
        return self.layers(x)

# Result: 97% accuracy

In distributed systems, we build pipelines: Data → Transform → Store → Serve

Neural networks are the same: Input → Hidden Layers → Output

The transformation happens through learning, not hardcoding rules.

2. Experimenting with Layer Architectures

Dropout Layers - Forces redundant representations. Like fault-tolerant systems - if one node fails, others handle the load.

Convolutional Layers - Respects spatial structure. Same filter slides across the image (parameter sharing). Like using the same load balancing algorithm across all services.

Dense layers: 97% accuracy → Conv layers: 99.2% accuracy

BatchNorm - Stabilizes training by normalizing inputs to each layer. Like circuit breakers in microservices.

Without BatchNorm: Stuck at 85% → With BatchNorm: 97%

3. Image Denoising with Autoencoders

The architecture: Noisy Image → [Encoder] → Bottleneck → [Decoder] → Clean Image

The bottleneck (32 dimensions) is key:

Too large (128): Memorizes noise
Too small (8): Loses detail
Just right (32): Learns what matters ✓

This is lossy compression with learned parameters. Like designing a caching layer - except the model learns the patterns.

Extra: In-painting (reconstructing obscured regions). Simple digits: 90% success, Complex digits: 60% success.

What Frustrated Me Most

Hyperparameter hell. Every decision affects everything - learning rate, layers, dropout, bottleneck size. I've spent years tuning JVM heaps and thread pools. This feels similar but with 10x more knobs.

Solution: Start with known-good defaults. Change one thing at a time. Keep notes.

The Debugging Moment

Problem: Autoencoder producing blurry reconstructions despite loss decreasing.

The fix: Changed optimizer from SGD to Adam.

Adam adapts learning rates per parameter. SGD uses same rate for everything. Like auto-scaling different services based on their individual load patterns.

Mistakes I Made

1. Tested on training data (again!)
Should know: Always test on unseen data.

2. Forgot model.eval()
Dropout was randomly disabling neurons during testing!
Training mode: 82% → Eval mode: 97%

3. Picked architecture randomly
5 layers? 256 neurons? Dropout 0.8? Model barely learned.
Fix: Started with proven architectures (LeNet), modified incrementally.

4. Didn't normalize input data
Raw pixels (0-255): unstable, loss exploding
Normalized (0-1): stable, converging ✓

The Pattern Recognition

What I've learned leading teams applies here:

Start simple, add complexity only when needed
Basic: 92% → +dropout: 95% → +BatchNorm: 97% → +Conv: 99%

Understand trade-offs
More layers = more capacity = slower training = higher overfitting risk

Experiment systematically
Bottleneck sizes: 8 (blurry) → 16 (better) → 32 (clean ✓) → 64 (memorizing) → 128 (overfitting)

Connection to Distributed Systems

Encoder-Decoder = Data Pipeline
The bottleneck is like network bandwidth - compress to fit through it.

Parameter Sharing = Code Reuse
One "edge detector" works everywhere. Efficient and effective.

Overfitting = Over-optimization
I've seen systems optimized for one traffic pattern that broke when patterns changed.
Solution: regularization / graceful degradation.

Time Spent This Week

About 8-10 hours this week.

What I'm Taking Forward

Neural networks aren't a stepping stone to "real" AI. They ARE real AI.

Most production ML uses these techniques. LLMs get the hype. But understanding backpropagation, gradients, and optimization matters.

I could be building LLM wrappers right now. But I wouldn't understand:

How training actually works
Why models fail in specific ways
When to use what architecture
How to debug learning problems

Starting with fundamentals means I can build real intuition.

The approach that works:

Start with fundamentals. Build something small (like MNIST). Don't start with "I'm going to build ChatGPT." Master the basics. Understand debugging. Build intuition. Then scale up.

Same advice I give for learning any new tech stack.

What's Still Hard

Choosing architecture (Conv vs Dense? How many layers?)
Hyperparameter tuning (still trial and error)
Knowing when to stop (97% vs 99%?)

These feel like architectural decisions I make daily - but with less intuition.

Week 4 down. Built neural networks that actually learn.

Learning deep learning as a senior engineer? What surprised you most about the transition?

DEV Community