You've heard the words. Neural networks. Deep learning AI. Transformers. Maybe you've even heard that deep learning is what powers ChatGPT, image generators, voice assistants, and self-driving cars. And now you're wondering: what actually is it?
This guide is the honest, comprehensive answer. Not watered down. Not padded with fluff. Just the core ideas, the key architectures, and the intuition you need to actually understand what's going on under the hood, explained clearly enough that a beginner can follow it, and thoroughly enough that it stays useful as you go deeper.
If you haven't read our post on machine learning basics yet, I'd recommend starting there. Deep learning builds directly on top of machine learning, and a few concepts from that post will make this one click much faster. That said, let's get into it.
What Is Deep Learning?
Deep learning is a branch of machine learning that uses artificial neural networks (systems loosely inspired by the structure of the human brain) to learn patterns from data.
The word "deep" refers to depth: the number of layers stacked inside the network. A shallow network might have one or two layers. A deep network might have dozens, hundreds, or in the case of modern large language models, thousands. Each layer transforms the data slightly, learning increasingly abstract representations as you move through the stack.
Here's the key insight that separates deep learning from classical machine learning: traditional ML algorithms need humans to engineer features manually. You decide what inputs matter. In deep learning, the network learns its own features directly from raw data, pixels, waveforms, text characters, without being told what to look for.
That's what makes it so powerful. And that's what makes it so data-hungry.
"Deep learning is not magic. It's a very large function with millions of parameters, trained on enormous amounts of data, that has learned to approximate patterns no human could write by hand."
What Is Deep Learning vs Machine Learning?
It's a nested relationship, not a competition. All deep learning is machine learning, but not all machine learning is deep learning.
Classical machine learning (decision trees, random forests, gradient boosting) works well on structured, tabular data. It's interpretable, efficient, and still dominant across most real-world business applications.
Deep learning takes over when the data is unstructured: images, audio, video, raw text. These formats have too many dimensions and too much complexity for traditional algorithms to handle well. Neural networks, with their layered feature learning, are built exactly for this.
- Use classical ML for credit scoring, demand forecasting, fraud detection, churn prediction: structured tables, interpretability required
- Use deep learning for image recognition, speech, language understanding, video: unstructured data, pattern complexity at scale
The honest take: deep learning is not always better. It requires significantly more data and compute. When you have a clean tabular dataset and a clear prediction task, XGBoost will often beat a neural network and train in seconds rather than hours.
Neural Network Basics
Before we get into specific architectures, you need to understand the building blocks. Every deep learning model, regardless of how complex, is built from the same core components.
Neurons, Layers, and Activation Functions
A neuron is the basic unit. It takes a set of inputs, multiplies each by a learned weight, sums everything up, and passes the result through an activation function. That output feeds into the next layer.
Here I attach an image of the comparison of a real neural network and an artificial one (on the right). You do not need to understand the formula, but just understand the following: 3 weights are coming to the neuron (these are the values of the previous neurons of a neural network). Inside there is a scary formula. However, trust me, is not that scary, it is basically is a fancy way of saying... Let's sum them all up! and just a little thing more I will describe soon, which is, we add an activation function to it.
That's all!
So... what is an activation function? the activation function is what gives neural networks their power. Without it, stacking layers would be mathematically equivalent to having just one layer: you'd just be doing linear transformations. Activation functions introduce non-linearity, which is what allows the network to learn complex patterns. If you did not completely get it, it is okay, this is the biggest mathematical part of the explanation!
The three you'll see everywhere:
- ReLU (Rectified Linear Unit): outputs zero for negative inputs, passes positive inputs through unchanged. Simple, fast, and the default choice for hidden layers in most networks. The fact that something this simple works so well is one of the quiet surprises of deep learning.
- Sigmoid: squashes output to a value between 0 and 1. Used in binary classification output layers where you want a probability.
- Softmax: extends sigmoid to multiple classes. Takes a vector of raw scores and converts them into probabilities that sum to 1. Used in the final layer of any multi-class classifier.
So basically, you can see the activation functions as a filter to decide what value will the neuron have after we summed up previous neurons values. And we can do it in different ways. If you want to understand it a bit deeper here is a deconstruction on how it exactly works:
Feedforward Networks: How Information Flows
Now you know how an artificial neural netwok works, great! The next step is stacking them up and we would get a neural network. Feedforward neural networks are the simplest neural network, information travels in one direction only: input → hidden layers → output. No loops, no memory, no feedback. Each layer is fully connected to the next, every neuron in layer N connects to every neuron in layer N+1.
This is called a feedforward network, and it's the foundation that every other architecture builds on top of or departs from. (yes, including chatGPT, Claude and other transformer based models, here is where it all starts.)
Backpropagation: How the Network Actually Learns
Training a neural network means finding the right weights. You do this by:
- Making a prediction with the current weights
- Measuring how wrong it was (the loss, remember this term)
- Computing how much each weight contributed to that error
- Nudging every weight slightly in the direction that reduces the loss
Step 3 is backpropagation, the algorithm that efficiently computes the gradient of the loss with respect to every weight in the network, propagating the error signal backward from the output layer to the input. Step 4 is gradient descent, the optimiser that uses those gradients to update the weights.
This loop (forward pass, compute loss, backward pass, update weights) repeats for millions or billions of iterations during training. That's how a network goes from random noise to something that can recognise faces, translate languages, or generate code.
Important note: How the neuron, backpropagation, gradient descent, losses work is important to understand. You will not need to perform any math in practice but this theory can help you grasp what you are doing better. Do not get stuck on it, but if you can learn it, I highly recommend it.
I decided to write it in this blog so you know it exists but this part and the next two (weight initialisation and batch normalization) could be skipped since they are not that much oriented towards the practice but more towards foundational knowledge.
Give it a quick read, if you do not fully get it, keep on going and dont worry!
Weight Initialisation
How you set the initial weights before training matters more than most tutorials admit. Start them all at zero and the network won't learn: every neuron computes the same thing and the gradients are identical. Start them too large and training becomes unstable. Smart initialisation schemes (Xavier, He initialisation) are designed to keep signal flowing cleanly through the network from the start.
Batch Normalisation
As networks get deeper, a problem emerges: the distribution of activations shifts during training, making learning unstable and slow. Batch normalisation addresses this by normalising the inputs to each layer across a mini-batch, keeping activations in a stable range. It's one of those techniques that felt like a trick when it was introduced and turned out to be foundational, it's now standard in almost every deep architecture.
Key Architectures
Now for the interesting part. If you made it till here, congratulations! now its the fun part, so, keep on reading!
Deep learning is not one thing, it's a family of architectures, each designed for a different kind of data and a different kind of problem. Here are the ones worth knowing.
Feedforward Neural Networks (FFNN)
The simplest deep network. Fully connected layers, information flows in one direction, no special structure. This is the architecture that introduces every concept (neurons, activations, backpropagation) in its clearest form.
In practice, pure FFNNs are rarely used for complex tasks. Images have spatial structure that FFNNs ignore. Sequences have temporal dependencies that FFNNs can't capture. But understanding the FFNN deeply is non-negotiable before moving to anything else.
When you'd use it: tabular data, simple classification and regression tasks, as a component inside larger architectures.
Convolutional Neural Networks (CNNs)
CNNs are the architecture that put deep learning on the map. In 2012, a CNN called AlexNet won the ImageNet competition by a margin so large it ended the debate about whether deep learning worked. It did.
The key idea: instead of connecting every neuron to every pixel (computationally insane for large images), CNNs apply small filters that slide across the input, detecting local patterns. Early layers learn to detect edges and textures. Later layers combine those into shapes, objects, faces.
This design is efficient, spatially aware, and extraordinarily effective on anything that has grid-like structure: images, video frames, certain kinds of audio.
When you'd use it: image classification, object detection, medical imaging, video analysis, any problem where spatial patterns matter.
Recurrent Neural Networks (RNNs) and LSTMs
What if your data is a sequence (a sentence, a time series, an audio clip) where the order of elements matters?
FFNNs and CNNs don't have memory. They process each input independently. RNNs fix this by feeding the hidden state from the previous step into the current step, giving the network a form of short-term memory.
In theory, this lets RNNs capture long-range dependencies. In practice, they struggle: gradients either explode or vanish as they travel through many time steps, making it hard to learn patterns that span long sequences.
LSTMs (Long Short-Term Memory networks) solve this with a more sophisticated memory mechanism, gates that control what information to keep, what to forget, and what to output. LSTMs were the state of the art for language and sequence tasks for years before Transformers arrived.
The following is a illustration of its internal construction, this is how one LSTM cell looks like:
Do not worry as you will never have implement this from scratch :)
Again, is just good to know this exists for the future usage.
They're not obsolete, they're still used in production systems where efficiency matters and sequences are modest in length. But for most language tasks, Transformers have superseded them.
The following is an illustration of how an entire LSTM looks like:
When you'd use them: time series forecasting, speech recognition (in resource-constrained settings), sensor data, any ordered sequence where Transformers would be overkill.
Autoencoders and VAEs
An autoencoder is trained to compress an input into a smaller representation (the latent space) and then reconstruct it back to the original. The bottleneck forces the network to learn the most essential features of the data.
Variational Autoencoders (VAEs) extend this by learning a probability distribution over the latent space rather than a fixed point. This makes the latent space continuous and structured, which means you can sample from it to generate new data, not just reconstruct existing inputs.
VAEs were an early serious approach to generative modelling and introduced ideas (latent space, encoder-decoder structure) that appear throughout modern AI.
When you'd use them: anomaly detection, data compression, generative modelling, representation learning, synthetic data generation.
GANs (Generative Adversarial Networks)
GANs are one of the most creative ideas in all of deep learning. The setup: two networks trained in opposition.
The generator produces fake data (images, audio, whatever the domain). The discriminator tries to tell real from fake. As training progresses, the generator gets better at fooling the discriminator, and the discriminator gets better at detecting fakes. Each improves in response to the other.
When it works, GANs produce extraordinarily realistic outputs. They dominated image synthesis for years and the photorealistic faces you may have seen on sites like "This Person Does Not Exist" are GAN-generated.
They're notoriously difficult to train, mode collapse, training instability, and sensitivity to hyperparameters make them frustrating in practice. Diffusion models have largely superseded them for image generation, but the adversarial training concept remains influential.
When you'd use them: image synthesis, data augmentation, style transfer, domain adaptation.
Diffusion Models
Diffusion models are the architecture behind Stable Diffusion, DALL-E, and most of the state-of-the-art image generators you've seen. The idea is elegant and counterintuitive.
Training: take real images and gradually add Gaussian noise until they're pure static. Teach the network to reverse this process, to predict and remove the noise at each step.
Generation: start with pure random noise and run the learned denoising process repeatedly until a coherent image emerges.
Diffusion models produce higher quality, more diverse outputs than GANs and train more stably. They're computationally heavier at inference time (many denoising steps required), but the quality improvement has made the trade-off worth it for most applications.
When you'd use them: image generation, video generation, audio synthesis, any high-quality generative task.
Personal Note
The latest three (VAEs, GANs and Difussion Models) are great for generative related tasks including synthetic data generation. Currently diffusion models have taken the space due to their high accuracy, efficiency and deployability.
We published in 2023 a research between Samsung Advanced Institute of Health Science and Technology (SAIHST), Samsung Medical Center (SMC), Yonsei Severance Hospital and Google Cloud (USA) comparing the use of the three of them for synthetic data generation on healthcare settings. If you are interested in the topic click here.
The Transformer
Everything changed in 2017 when a Google paper titled "Attention Is All You Need" introduced the Transformer architecture. GPT, BERT, DALL-E, Whisper, Stable Diffusion, every major model of the last several years is built on top of it or derives from it directly.
The core innovation: self-attention. Instead of processing a sequence step by step (like an RNN), the Transformer processes all positions simultaneously and lets each position directly attend to every other position. This solves the long-range dependency problem completely, and critically, allows full parallelisation during training.
Self-Attention and Multi-Head Attention
Self-attention allows the model to weigh how relevant each word (or token) is to every other word when building a representation. In the sentence "The bank by the river was steep," the word "bank" needs to attend strongly to "river" to resolve its meaning correctly. Self-attention learns to do this.
Multi-head attention runs several self-attention operations in parallel, each learning to attend to different kinds of relationships simultaneously. One head might track syntactic structure; another might track semantic similarity. The outputs are combined and projected forward.
Positional Encoding
Transformers have no built-in sense of order, self-attention is permutation-invariant. Positional encoding fixes this by adding information about each token's position in the sequence before it enters the network. The model learns to use this position signal to understand order, proximity, and structure.
Encoder vs. Decoder vs. Encoder-Decoder
Not all Transformers are the same. There are three architectural variants:
- Encoder-only (e.g. BERT): reads the full sequence bidirectionally, building rich contextual representations. Best for tasks that require understanding: classification, named entity recognition, semantic search.
- Decoder-only (e.g. GPT): generates tokens one at a time, each attending only to previous tokens. Best for generation: writing, code, conversation.
- Encoder-decoder (e.g. T5, original Transformer): encodes an input sequence, then decodes an output sequence. Best for transformation tasks: translation, summarisation, question answering.
Understanding which variant you're working with (and why it was chosen) is one of the most practically useful things you can know when working with modern AI.
What Comes Next
You now have a map of the deep learning landscape: the building blocks, the key architectures, when to use each, and why they exist. That's the conceptual foundation.
The practical path from here:
- Get hands-on with PyTorch or TensorFlow: implement a simple FFNN, then a CNN on image data. Seeing the training loop in code cements everything.
- Work through a sequence task: build or use an LSTM on a real time series dataset.
- Study the Transformer in depth: read "Attention Is All You Need" after you've built intuition. It will make sense now in a way it wouldn't have before.
- Explore modern applications: fine-tune a pretrained model, experiment with diffusion pipelines, build something that uses what you've learned.
If you're wondering where deep learning fits in the bigger picture (how it relates to machine learning and where Generative AI comes in) check out our AI learning roadmap for the full view.
The architecture names will start feeling familiar quickly. Build things. Break them. Figure out why. That's the actual learning.
If you want to learn more, we have more content in our blog here!











Top comments (0)