"Can neural networks really compress faces efficiently, without losing identity?"
In this post, I explore this question by building and comparing two popular generative compression architectures: Variational Autoencoder (VAE) and Vector Quantized VAE (VQ-VAE) β trained on passport-style human face images.
π GitHub Repository
π Dataset Source (Kaggle)
π¦ Why Autoencoders for Image Compression?
Autoencoders learn to reconstruct input data from a compact representation (latent space). This enables lossy compression by:
- Removing irrelevant pixel-level noise
- Learning semantic structure (e.g., eyes, nose, face contour)
- Outputting reconstructions that are visually close to original but much smaller in size
But not all autoencoders are created equal. Letβs break down how VAE and VQ-VAE differ β and which one works best for face images.
π§ Project Setup
- Dataset: 3000+ frontal face images from Kaggle (balanced by lighting, expression, and gender)
- All images resized to 64Γ64 or 128Γ128
- Trained on CPU with PyTorch
- Output format: JPEG (quality=85)
# Install dependencies
pip install -r requirements.txt
π§ Architecture 1: Variational Autoencoder (VAE)
VAE is a probabilistic generative model that learns a continuous latent space:
- Encoder outputs mean (ΞΌ) and log variance (logΟΒ²)
- Latent vector sampled as:
z = ΞΌ + Ο * Ξ΅
where Ξ΅ ~ N(0,1) - Decoder reconstructs image from z
mu = fc_mu(encoder(x))
logvar = fc_logvar(encoder(x))
z = reparameterize(mu, logvar)
x_hat = decoder(z)
Loss = MSE reconstruction + KL divergence (to enforce Gaussian distribution)
β Pros:
- Smooth latent space, good for interpolation
- Easy to implement
β Cons:
- Blurry outputs due to probabilistic sampling
- Gaussian prior limits representation precision
πΈ Sample Result (64Γ64, 50 epochs)
πΌοΈ Original: 93.71 KB
π Reconstructed: 1.62 KB
π Compression Rate: 57.84x
π§ Architecture 2: Vector Quantized VAE (VQ-VAE)
VQ-VAE replaces the continuous latent space with discrete codebook vectors:
- Encoder outputs feature map β quantized to nearest embedding
- Decoder reconstructs image from quantized features
z = encoder(x)
quantized, vq_loss = vector_quantizer(z)
x_hat = decoder(quantized)
Loss = MSE reconstruction + VQ commitment loss
β Pros:
- Sharper and more detailed reconstructions
- Discrete representations better for downstream tasks
β Cons:
- Slightly harder to train
- Requires codebook tuning (size, commitment cost)
πΈ Sample Result (128Γ128, 50 epochs)
πΌοΈ Original: 93.71 KB
π Reconstructed: 3.66 KB
π Compression Rate: 25.58x
βοΈ Why These Architectures?
I chose VAE and VQ-VAE because they represent two fundamentally different approaches to learning compressed representations:
VAE | VQ-VAE | |
---|---|---|
Latent Space | Continuous (Gaussian) | Discrete (codebook) |
Output Style | Smooth, blurry | Crisp, pixel-accurate |
Use Case | Interpolation, generation | Compression, deployment |
In practice, the difference was immediately visible: VQ-VAE produced sharper eyes, better skin texture, and preserved the facial layout more accurately.
π Comparison Results
Model | Resolution | Epochs | Output Size | Compression Rate | Visual Quality |
---|---|---|---|---|---|
VAE | 64Γ64 | 20 | 1.54 KB | 60.85Γ | βββββ |
VAE | 64Γ64 | 50 | 1.62 KB | 57.84Γ | βββββ |
VQ-VAE | 64Γ64 | 20 | 1.62 KB | 57.98Γ | βββββ |
VQ-VAE | 128Γ128 | 50 | 3.66 KB | 25.58Γ | βββββ |
πΌοΈ Visual Comparison
VQ-VAE 128Γ128 β 50 Epochs
VQ-VAE 64Γ64 β 20 Epochs
VAE 64Γ64 β 20 Epochs
VAE 64Γ64 β 50 Epochs
π Loss Curves & Insights
VAE Training Loss
)
- Converges smoothly after ~35 epochs
- Most gain occurs early (first 20 epochs)
VQ-VAE Training Losses
- Breakdown: total, reconstruction, and VQ commitment loss
- VQ loss stabilizes quickly while reconstruction improves more gradually
π§ Takeaways
- VAE is easier to train and interpret but suffers from blur due to probabilistic sampling
- VQ-VAE captures high-frequency structure better and preserves identity at higher compression
- At 64x64, both models compress extremely well, but VQ-VAE outperforms visually
- At 128x128, VQ-VAE dominates in realism and perceptual clarity
π» Run the Code Yourself
git clone https://github.com/Ertugrulmutlu/VQVAE-and-VAE
cd VQVAE-and-VAE
pip install -r requirements.txt
python main.py
π§Ύ References
If you found this comparison helpful or insightful, consider β starring the GitHub repository β and feel free to reach out with feedback or questions!
β Github ErtuΔrul Mutlu
Top comments (0)