Ertugrul

Posted on Aug 5

🎭 Compressing Human Faces with VAE vs VQ-VAE — A Deep Dive into Autoencoder Design

#vae #vqvae #deeplearning #python

"Can neural networks really compress faces efficiently, without losing identity?"

In this post, I explore this question by building and comparing two popular generative compression architectures: Variational Autoencoder (VAE) and Vector Quantized VAE (VQ-VAE) — trained on passport-style human face images.

🔗 GitHub Repository
📂 Dataset Source (Kaggle)

📦 Why Autoencoders for Image Compression?

Autoencoders learn to reconstruct input data from a compact representation (latent space). This enables lossy compression by:

Removing irrelevant pixel-level noise
Learning semantic structure (e.g., eyes, nose, face contour)
Outputting reconstructions that are visually close to original but much smaller in size

But not all autoencoders are created equal. Let’s break down how VAE and VQ-VAE differ — and which one works best for face images.

🔧 Project Setup

Dataset: 3000+ frontal face images from Kaggle (balanced by lighting, expression, and gender)
All images resized to 64×64 or 128×128
Trained on CPU with PyTorch
Output format: JPEG (quality=85)

# Install dependencies
pip install -r requirements.txt

🧠 Architecture 1: Variational Autoencoder (VAE)

VAE is a probabilistic generative model that learns a continuous latent space:

Encoder outputs mean (μ) and log variance (logσ²)
Latent vector sampled as: z = μ + σ * ε where ε ~ N(0,1)
Decoder reconstructs image from z

mu = fc_mu(encoder(x))
logvar = fc_logvar(encoder(x))
z = reparameterize(mu, logvar)
x_hat = decoder(z)

Loss = MSE reconstruction + KL divergence (to enforce Gaussian distribution)

✅ Pros:

Smooth latent space, good for interpolation
Easy to implement

❌ Cons:

Blurry outputs due to probabilistic sampling
Gaussian prior limits representation precision

📸 Sample Result (64×64, 50 epochs)

🖼️ Original:     93.71 KB
🔁 Reconstructed: 1.62 KB
📉 Compression Rate: 57.84x

🧠 Architecture 2: Vector Quantized VAE (VQ-VAE)

VQ-VAE replaces the continuous latent space with discrete codebook vectors:

Encoder outputs feature map → quantized to nearest embedding
Decoder reconstructs image from quantized features

z = encoder(x)
quantized, vq_loss = vector_quantizer(z)
x_hat = decoder(quantized)

Loss = MSE reconstruction + VQ commitment loss

✅ Pros:

Sharper and more detailed reconstructions
Discrete representations better for downstream tasks

❌ Cons:

Slightly harder to train
Requires codebook tuning (size, commitment cost)

📸 Sample Result (128×128, 50 epochs)

🖼️ Original:     93.71 KB
🔁 Reconstructed: 3.66 KB
📉 Compression Rate: 25.58x

⚙️ Why These Architectures?

I chose VAE and VQ-VAE because they represent two fundamentally different approaches to learning compressed representations:

	VAE	VQ-VAE
Latent Space	Continuous (Gaussian)	Discrete (codebook)
Output Style	Smooth, blurry	Crisp, pixel-accurate
Use Case	Interpolation, generation	Compression, deployment

In practice, the difference was immediately visible: VQ-VAE produced sharper eyes, better skin texture, and preserved the facial layout more accurately.

📊 Comparison Results

Model	Resolution	Epochs	Output Size	Compression Rate	Visual Quality
VAE	64×64	20	1.54 KB	60.85×	⭐⭐☆☆☆
VAE	64×64	50	1.62 KB	57.84×	⭐⭐⭐☆☆
VQ-VAE	64×64	20	1.62 KB	57.98×	⭐⭐⭐⭐☆
VQ-VAE	128×128	50	3.66 KB	25.58×	⭐⭐⭐⭐⭐

🖼️ Visual Comparison

📉 Loss Curves & Insights

VAE Training Loss

$vae\_loss$

)

Converges smoothly after ~35 epochs
Most gain occurs early (first 20 epochs)

VQ-VAE Training Losses

$vqvae\_loss$

Breakdown: total, reconstruction, and VQ commitment loss
VQ loss stabilizes quickly while reconstruction improves more gradually

🧠 Takeaways

VAE is easier to train and interpret but suffers from blur due to probabilistic sampling
VQ-VAE captures high-frequency structure better and preserves identity at higher compression
At 64x64, both models compress extremely well, but VQ-VAE outperforms visually
At 128x128, VQ-VAE dominates in realism and perceptual clarity

💻 Run the Code Yourself

git clone https://github.com/Ertugrulmutlu/VQVAE-and-VAE
cd VQVAE-and-VAE
pip install -r requirements.txt
python main.py

🧾 References

If you found this comparison helpful or insightful, consider ⭐ starring the GitHub repository — and feel free to reach out with feedback or questions!

— Github Ertuğrul Mutlu

DEV Community

🎭 Compressing Human Faces with VAE vs VQ-VAE — A Deep Dive into Autoencoder Design

📦 Why Autoencoders for Image Compression?

🔧 Project Setup

🧠 Architecture 1: Variational Autoencoder (VAE)

✅ Pros:

❌ Cons:

🧠 Architecture 2: Vector Quantized VAE (VQ-VAE)

✅ Pros:

❌ Cons:

⚙️ Why These Architectures?

📊 Comparison Results

🖼️ Visual Comparison

VQ-VAE 128×128 – 50 Epochs

VQ-VAE 64×64 – 20 Epochs

VAE 64×64 – 20 Epochs

VAE 64×64 – 50 Epochs

📉 Loss Curves & Insights

VAE Training Loss

VQ-VAE Training Losses

🧠 Takeaways

💻 Run the Code Yourself

🧾 References

Top comments (0)