Mitansh Gor

Posted on Apr 20

GEN-AI-3 : VAE

#deeplearning #vae #autoencoder #generativeai

So far, we’ve seen how Autoencoders (AEs) can take data—like images, audio, or text—and compress it into a lower-dimensional space, only to reconstruct it again like a digital magician pulling a rabbit out of a hat. Pretty neat, right?

But there’s a catch: while AEs are great at learning compact representations, they’re not exactly dreamers. Their latent space—the core of their compressed understanding—isn’t built for imagination. Try sampling from it randomly, and you’re more likely to get noise than meaningful data.

Now, what if we wanted a model that not only compresses data but can also generate new, meaningful samples that look like they came from the original dataset? A model that understands probability, uncertainty, and can dream up new content with finesse?

🎉 Enter the Variational Autoencoder (VAE).

In this post, we’re going to unpack how VAEs work, why they’re a major leap forward from traditional autoencoders, and how they lay the groundwork for some of the most exciting generative models in AI today.

Let’s dive in.

The Problem with Regular Autoencoders 🧩

Traditional autoencoders compress data by learning a direct mapping from input to a latent vector, and then decompress it using a decoder. While effective for feature learning, they suffer from:

Disorganized Latent Space: Nearby latent points don’t necessarily produce similar outputs.
Poor Generative Ability: Sampling randomly from the latent space usually results in noisy, incoherent outputs.
No Uncertainty Modeling: The model doesn't capture how confident it is in the latent representation.
We needed structure, smoothness, and the ability to reason probabilistically.

Variational Autoencoders

VAEs solve these issues by introducing two key ideas:

Injecting randomness into the encoding process.
Imposing constraints on the distribution of the latent space.

Instead of encoding an input to a fixed point, the VAE encodes it to a probability distribution—specifically, a multivariate normal distribution centered at a point in the latent space.
Or in other words,
Instead of encoding input into a fixed latent vector, it encodes it into a distribution (usually Gaussian), from which it samples.

By doing so, it not only compresses data, but VAEs learn to imagine variations.

🎯 Constraints on the latent space

To bring structure to the latent space, we constrain how encodings are distributed:

🌀 Centering: Each encoded distribution should be centered as close to the origin (0, 0, ..., 0) as possible.
📏 Unit Variance: The spread (standard deviation) of each distribution should be close to 1.

The further the encoder deviates from these goals, the higher the loss during training.

Why? This forces all encoded samples to live near the same area in latent space, enabling smooth interpolation and consistent generation.

Changes in VAE Encoder

The encoder no longer outputs just a point, but rather, it parameterizes a distribution from which we can sample.

x → (μ, σ) → z ~ N(μ, σ²)

We do this because we want to generate new data by sampling from a continuous, meaningful latent space.

z_mean (μ): The center of the distribution where we want our encoding to be.
z_log_var (log(σ²)): The (log of) the variance, controlling the spread of that distribution.

Together, they define a normal distribution, from which we will later sample to get a latent variable z for decoding.

Now the question is why log(σ²) and not just σ² ?
There are two major reasons: positivity constraint, numerical stability

Variance Must Be Positive: If we try to output σ directly from a neural network, we need to ensure that the network only produces positive values. But neural nets naturally output values from (−∞,∞). Hence, log(σ²) guarantees that σ>0, always.
We work in the log domain, where multiplications become additions, and exponentials become linear. It avoids premature underflow for tiny variances or instability from tiny gradients when optimizing σ directly.

Imagine we want to allow the encoder to learn variances from 0.0001 to 1000:
If we output σ directly, the network must learn to span that huge dynamic range. But if we output log(σ²), the values range from about:
log(0.0001)=−9.2 to log(1000)=6.9.
A much more manageable range!

But wait on!!
There is one problem.

Let's go through it using the example

Let’s imagine the encoder outputs:
μ=0.5
σ=1.0
If we directly sample
z = np.random.normal(mu, sigma)

This operation:
Picks a random number (say 0.23 or 1.42) with no guarantee

There's no way to know how the output Z would change if μ or
σ changed — because the randomness hides the function's slope

And without a gradient, the network can’t learn.

How can we deal with it?

Reparameterization Trick

Here’s the twist:

Instead of sampling like this:
z∼N(μ,σ2)
We reparameterize it as:
z = μ+ σ⋅ε where ε∼N(0,1)
Here :

ε = Random noise from a fixed distribution
μ, σ = Output from the encoder (learnable)
z = Latent vector to feed into the decoder

Why this works:

ε is independent of the network, so its randomness doesn’t interfere with gradient flow.
μ and σ are now involved in a deterministic operation (addition and multiplication), so gradients can be calculated
Now, you can compute:
∂z/∂μ = 1 ∂z/∂σ = ε

Backpropagation is happy again. 🎉

This tiny trick makes backpropagation work through the stochastic layer. Without it, training would collapse.

Changes in VAE Decoder wrt AE

The decoder no longer sees a fixed vector in the latent space.
Instead, it gets:
z=μ+σ⋅ε
The decoder must be able to take any nearby sample around μ and still reconstruct a very similar output.

VAE Loss Function: More Than Just Reconstruction

We used reconstruction loss for AutoEncoders. We cannot use just the reconstruction loss anymore. We’re now dealing with probability distributions, and we need to keep the latent distributions close to a standard normal (N(0,1)).

So we add a Kullback–Leibler (KL) divergence term:

Loss = ReconstructionLoss + β∗KLDivergence

KL measures how much our learned distribution (μ, σ²) deviates from the standard normal. A higher KL means our encoding is straying too far.

Think of it like this:
Reconstruction loss: "How well did we recreate the input?"
KL divergence: "How wild is our latent distribution? Should we calm it down?"

What's the use case of the β Parameter?

The β coefficient balances reconstruction and regularization.

If β is too small, the KL divergence is ignored. The latent space becomes disorganized, similar to a vanilla AE. Good reconstructions, bad generation.
If β is too large, the model prioritizes matching N(0,1) over reconstruction. All samples start looking the same—blurry outputs, poor expressiveness.

✅ Sweet spot: When β is balanced, we get coherent generation and meaningful reconstructions.

🧱 Disadvantages of VAEs

Tends to generate blurry images
KL term is tricky to balance with reconstruction loss
Not ideal for high-resolution data (GANs often outperform here)

TL;DR: VAEs are smarter but also harder to tame.

🧭 Wrap-up: When to Use What and Why It All Matters

Use Autoencoders when you need compression, denoising, or anomaly detection.
Use VAEs when you need controlled generation, diversity, and smooth interpolation.
Use GANs when you need photo-realism.

The future of generative models lies in hybrids — combining VAEs, GANs, and Diffusion models.

DEV Community