DEV Community

Cover image for I Tried to Run VGG19 on a CPU… It Failed. So I Fixed It."
B.Nikhil Tej
B.Nikhil Tej

Posted on

I Tried to Run VGG19 on a CPU… It Failed. So I Fixed It."

"Turning a 500MB deep learning model into something actually usable using pruning, quantization, and simple tricks"

Deep learning models look impressive when you read about them.

High accuracy, benchmark scores, state-of-the-art results.

But the moment you try to actually use one in a real system, things feel very different.

That’s exactly what happened when I started working with VGG19.


The problem I ran into

I picked VGG19 because it's simple and widely used. It felt like a safe choice.

But when I tried to run it on a CPU:

  • It was slow
  • It took time just to load
  • Memory usage was huge

At that point, it became obvious:

This model is powerful, but not practical.

Instead of switching to a different model, I wanted to understand something deeper:

Can I make this model usable instead of replacing it?


What VGG19 actually looks like

VGG 19 Architecture

A comparison diagram showing the layer-by-layer architecture of VGG-19 alongside plain and residual 34-layer networks, highlighting how convolution blocks, pooling, and residual connections differ in structure and depth.

VGG19 is built using:

  • 16 convolution layers
  • 5 max-pooling layers
  • 3 fully connected layers

Initially, I thought the convolution layers were the main reason for its size.

But after digging deeper, I realized something important:

Most of the parameters are actually in the fully connected layers.

That’s why the model ends up with:

  • ~143 million parameters
  • ~500MB size

What I built

Instead of guessing what works, I built a small system to test things properly.

The idea was simple:

  • Upload an image
  • Run it through different optimized versions of the VGG19 model
  • Compare the results side-by-side

Here’s what my interface looks like in practice:

My Interface

Each card represents a different version of the same model:

  • Baseline
  • Pruned
  • Quantized
  • Pruned + Quantized
  • Input scaled

This makes it easy to visually compare how each optimization affects performance.


1. Structured pruning — removing unnecessary filters

The first thing I tried was pruning.

Not random pruning, but structured pruning.

Instead of removing individual weights, I removed entire filters from convolution layers.

What I did

  • Focused on deeper layers (early layers are more important)
  • Removed about 10% of filters
  • Used L2 norm to identify less important filters

What I observed

  • The model became lighter
  • Inference got faster
  • Accuracy dropped slightly

To fix that, I added a short fine-tuning step.

That helped recover most of the performance.


2. Quantization — reducing precision

Next, I explored quantization.

Instead of changing the structure, this changes how weights are stored.

  • Before → 32-bit floating point
  • After → 8-bit integers

What stood out

Just quantizing the fully connected layers reduced a huge portion of the model size.

Fully connected layers are the main bottleneck.

It also made inference faster on CPU.


3. Combining both — what worked best

After trying both techniques separately, I combined them.

The order mattered:

  1. Prune the model
  2. Fine-tune it
  3. Apply quantization

This worked much better than using either technique on its own.

Final result:

  • Smaller model
  • Faster inference
  • Predictions still mostly consistent

4. Input scaling — the simplest trick

One of the simplest changes gave surprisingly good results.

Instead of modifying the model, I reduced the input size:

  • From 224×224
  • To 160×160

That’s almost half the number of pixels.

Since convolution depends heavily on input size, this reduced computation significantly.

No retraining was needed.


What the results showed

Looking at the interface output:

  • Baseline → slowest and largest
  • Pruning → moderate speed improvement
  • Quantization → major size reduction
  • Pruning + Quantization → best balance
  • Input scaling → highest speed boost

No single technique solves everything.

Each technique targets a different limitation:

  • Pruning → reduces computation
  • Quantization → reduces memory
  • Input scaling → reduces workload

What I learned

A few things stood out clearly:

  1. Models are often overbuilt
  2. Fully connected layers are the real bottleneck
  3. Simple changes matter
  4. Combining techniques works best

The trade-off

There’s always a balance:

  • Smaller model → faster
  • But → slight accuracy drop

The goal is to manage that trade-off.


Why this matters

This isn’t just about VGG19.

These ideas apply to:

  • Edge devices
  • Mobile applications
  • Real-time systems
  • IoT deployments

Anywhere performance and memory are limited.


Final thought

Working on this changed how I think about deep learning models.

It’s not just about building accurate models.

It’s about making them usable.


Code and Project

If you want to explore the implementation:

👉 https://github.com/Nikhil-tej108/VGG19


What I want to explore next

  • Knowledge distillation
  • Edge deployment
  • Real-time inference

If you've worked on similar optimization problems, I'd love to hear your approach.

Top comments (0)