DEV Community

Ertugrul
Ertugrul

Posted on

🎭 Compressing Human Faces with VAE vs VQ-VAE β€” A Deep Dive into Autoencoder Design

"Can neural networks really compress faces efficiently, without losing identity?"

In this post, I explore this question by building and comparing two popular generative compression architectures: Variational Autoencoder (VAE) and Vector Quantized VAE (VQ-VAE) β€” trained on passport-style human face images.

πŸ”— GitHub Repository
πŸ“‚ Dataset Source (Kaggle)


πŸ“¦ Why Autoencoders for Image Compression?

Autoencoders learn to reconstruct input data from a compact representation (latent space). This enables lossy compression by:

  • Removing irrelevant pixel-level noise
  • Learning semantic structure (e.g., eyes, nose, face contour)
  • Outputting reconstructions that are visually close to original but much smaller in size

But not all autoencoders are created equal. Let’s break down how VAE and VQ-VAE differ β€” and which one works best for face images.


πŸ”§ Project Setup

  • Dataset: 3000+ frontal face images from Kaggle (balanced by lighting, expression, and gender)
  • All images resized to 64Γ—64 or 128Γ—128
  • Trained on CPU with PyTorch
  • Output format: JPEG (quality=85)
# Install dependencies
pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

🧠 Architecture 1: Variational Autoencoder (VAE)

VAE is a probabilistic generative model that learns a continuous latent space:

  • Encoder outputs mean (ΞΌ) and log variance (logσ²)
  • Latent vector sampled as: z = ΞΌ + Οƒ * Ξ΅ where Ξ΅ ~ N(0,1)
  • Decoder reconstructs image from z
mu = fc_mu(encoder(x))
logvar = fc_logvar(encoder(x))
z = reparameterize(mu, logvar)
x_hat = decoder(z)
Enter fullscreen mode Exit fullscreen mode

Loss = MSE reconstruction + KL divergence (to enforce Gaussian distribution)

βœ… Pros:

  • Smooth latent space, good for interpolation
  • Easy to implement

❌ Cons:

  • Blurry outputs due to probabilistic sampling
  • Gaussian prior limits representation precision

πŸ“Έ Sample Result (64Γ—64, 50 epochs)

πŸ–ΌοΈ Original:     93.71 KB
πŸ” Reconstructed: 1.62 KB
πŸ“‰ Compression Rate: 57.84x
Enter fullscreen mode Exit fullscreen mode

🧠 Architecture 2: Vector Quantized VAE (VQ-VAE)

VQ-VAE replaces the continuous latent space with discrete codebook vectors:

  • Encoder outputs feature map β†’ quantized to nearest embedding
  • Decoder reconstructs image from quantized features
z = encoder(x)
quantized, vq_loss = vector_quantizer(z)
x_hat = decoder(quantized)
Enter fullscreen mode Exit fullscreen mode

Loss = MSE reconstruction + VQ commitment loss

βœ… Pros:

  • Sharper and more detailed reconstructions
  • Discrete representations better for downstream tasks

❌ Cons:

  • Slightly harder to train
  • Requires codebook tuning (size, commitment cost)

πŸ“Έ Sample Result (128Γ—128, 50 epochs)

πŸ–ΌοΈ Original:     93.71 KB
πŸ” Reconstructed: 3.66 KB
πŸ“‰ Compression Rate: 25.58x
Enter fullscreen mode Exit fullscreen mode

βš™οΈ Why These Architectures?

I chose VAE and VQ-VAE because they represent two fundamentally different approaches to learning compressed representations:

VAE VQ-VAE
Latent Space Continuous (Gaussian) Discrete (codebook)
Output Style Smooth, blurry Crisp, pixel-accurate
Use Case Interpolation, generation Compression, deployment

In practice, the difference was immediately visible: VQ-VAE produced sharper eyes, better skin texture, and preserved the facial layout more accurately.


πŸ“Š Comparison Results

Model Resolution Epochs Output Size Compression Rate Visual Quality
VAE 64Γ—64 20 1.54 KB 60.85Γ— β­β­β˜†β˜†β˜†
VAE 64Γ—64 50 1.62 KB 57.84Γ— β­β­β­β˜†β˜†
VQ-VAE 64Γ—64 20 1.62 KB 57.98Γ— β­β­β­β­β˜†
VQ-VAE 128Γ—128 50 3.66 KB 25.58Γ— ⭐⭐⭐⭐⭐

πŸ–ΌοΈ Visual Comparison

VQ-VAE 128Γ—128 – 50 Epochs

vqvae\_128\_50ep

VQ-VAE 64Γ—64 – 20 Epochs

vqvae\_64\_20ep

VAE 64Γ—64 – 20 Epochs

vae\_64\_20ep

VAE 64Γ—64 – 50 Epochs

vae\_64\_50ep


πŸ“‰ Loss Curves & Insights

VAE Training Loss

vae\_loss

)

  • Converges smoothly after ~35 epochs
  • Most gain occurs early (first 20 epochs)

VQ-VAE Training Losses

vqvae\_loss

  • Breakdown: total, reconstruction, and VQ commitment loss
  • VQ loss stabilizes quickly while reconstruction improves more gradually

🧠 Takeaways

  • VAE is easier to train and interpret but suffers from blur due to probabilistic sampling
  • VQ-VAE captures high-frequency structure better and preserves identity at higher compression
  • At 64x64, both models compress extremely well, but VQ-VAE outperforms visually
  • At 128x128, VQ-VAE dominates in realism and perceptual clarity

πŸ’» Run the Code Yourself

git clone https://github.com/Ertugrulmutlu/VQVAE-and-VAE
cd VQVAE-and-VAE
pip install -r requirements.txt
python main.py
Enter fullscreen mode Exit fullscreen mode

🧾 References


If you found this comparison helpful or insightful, consider ⭐ starring the GitHub repository β€” and feel free to reach out with feedback or questions!

β€” Github Ertuğrul Mutlu

Top comments (0)