DEV Community

MustafaLSailor
MustafaLSailor

Posted on

Deep Learning

Deep learning is a subfield of artificial intelligence and involves algorithms that attempt to mimic the way the human brain learns to process information. This type of learning is often accomplished using systems called artificial neural networks.

Deep learning models consist of artificial neural networks with many layers (depth). Each layer takes the information from the previous layer, performs some calculations on it and passes the results to the next layer. This process continues until the output of the network is reached.

Deep learning generally relies on the presence of large amounts of labeled data. For example, a deep learning model can be trained with millions of images and have labels indicating what each image is. Using this data, the model learns patterns and structures in images and can use this information to classify images it has never seen before.

Deep learning is used in many application areas such as voice recognition, image recognition, natural language processing and bioinformatics. Additionally, models such as GPT, Falcon, FLAN-T5 and Stable Diffusion also use deep learning techniques.

Neuron

Image description

In the Artificial Neural Networks (ANN) model, a "neuron" is a computational unit. It mimics the function of biological neurons in the real brain. Each neuron receives a set of inputs, performs a specific calculation on those inputs, and produces an output.

The functioning of a neuron in an ANN is generally as follows:

The neuron receives input from other neurons. Each entry is usually multiplied by a "weight". Weights are the parameters by which the model is adjusted during the learning process, determining the impact of each input on the result.

The neuron takes the sum of all weighted inputs and usually adds a “bias” term to it. Bias is another parameter that the model adjusts during the learning process.

Finally, the neuron passes this sum to an “activation function.” The activation function determines what the neuron's output will be. It is generally a non-linear function and allows the model to learn complex patterns.

Neurons that work in this way are arranged in layers and connected to each other. The outputs of neurons in one layer are used as inputs of neurons in the next layer. This structure enables ANN to perform "deep" learning.

Activation Function

Image description

Image description

Image description

The main task of the activation function is to determine the output of an artificial neural network neuron. Activation functions are generally non-linear, and thanks to this feature, the neural network can model complex patterns and relationships.

Activation functions control under what circumstances a neuron's total input will 'activate' the neuron (i.e. produce output) and under what circumstances it will remain 'inactive'. This is, in a sense, the 'firing' mechanism of the neuron so that the network can learn the appropriate output corresponding to a given set of input.

Additionally, activation functions provide the artificial neural network's ability to solve nonlinear problems. If there were no activation functions, the entire network could only perform a linear transformation, even though we had many layers. In this case, the model would not be able to learn complex data sets and complex relationships.

For example, the ReLU (Rectified Linear Unit) activation function passes positive input values as they are, while negative values are reset to zero. The sigmoid activation function converts any input into a value between 0 and 1. These and other activation functions help the network solve different types of problems.

# And OR gate

Image description

Image description

AND and OR gates are simple structures that represent logical operations, and these logical operations can be used to model the basic functioning of a neuron.

AND Gate: If both inputs are true (1), the output is true (1), in all other cases the output is false (0). A neuron can work as an AND gate if appropriate weights and bias are set. For example, weights that will make both inputs 1 and a step function (a function that outputs 1 when its input exceeds a certain threshold and 0 otherwise) can be used as the activation function.

OR Gate: When at least one of the two inputs is true (1), the output is true (1), and when both inputs are false (0), the output is false. A neuron can work as an OR gate if appropriate weights and bias are set. For example, weights that will make any input 1 and a step function can be used as an activation function.

These simple logical operations form the basis of more complex neural network structures. However, neural networks used to solve real-world problems generally use much more complex structures and activation functions.

XOR Problem

Image description

in order AND OR XOR graphic.
Image description

The XOR problem is a classic problem that shows the limits of the learning capabilities of artificial neural networks (ANN). XOR represents a logical operation and outputs true (1) if the two inputs are different and false (0) if they are the same.

For example, the XOR operation is:

0 XOR 0 = 0
0 XOR 1 = 1
1XOR 0 = 1
1XOR 1=0
This is a problem that cannot be learned with a single-layer ANN because the XOR operation is not a linearly separable problem. That is, on a plane representing input values, a single line cannot be drawn to separate the outputs.

However, a multilayer ANN (or deep learning model) can deal with the XOR problem. This is usually achieved with a hidden layer and appropriate activation functions. The hidden layer can transform the input data into a higher dimensional space, making the problem linearly separable. This is an example of ANN's ability to learn non-linear problems.

Image description

So a single layer ANN has problems with nonlinear problems.

How does an ANN learn ?

Image description

Artificial neural networks (ANNs) typically carry out the learning process over a series of iterations or 'epochs'. At each epoch, the network receives input data, calculates outputs through a process called feedforward, calculates an error by comparing the output to the true value, and updates the weights and biases to minimize this error. This update process is usually done using an algorithm called backpropagation.

Here are the stages of the learning process of ANN:

Feedforward: The network takes the input data and calculates the outputs of the neurons in each layer. This starts by multiplying each neuron's inputs by weights and summing the results. This sum is then passed through an activation function and the output of the neuron is obtained. This process is repeated across all layers of the network.

Error Calculation: The output of the network is compared to the expected output and an error is calculated. This is usually done using a loss function. The loss function measures how 'wrong' the network's prediction is.

Backpropagation: The error is differentiated with weights and biases. This is used to determine the parameters of the network that increase or decrease the error the most.

Weight Update: In the last step, the weights and biases are updated to minimize the error. This is usually done using an optimization algorithm, the most commonly used being the stochastic gradient descent (SGD) algorithm.

This process is repeated for the specified number of epochs or until another stopping criterion is met. As a result, the network has 'learned' to produce the expected outputs against the input data.

Gradiend Descent

Image description

Gradient Descent is an optimization algorithm used to optimize the parameters (weights and biases) in an artificial neural network (ANN). This algorithm aims to find the minimum of an error or loss function.

The basic idea of Gradient Descent is to calculate the derivative (or gradient) of the loss function and "descent" towards the minimum of the function by taking steps in the negative direction of this derivative. The gradient shows which direction the function increases fastest, so moving in the negative direction decreases the function fastest.

The steps of the Gradient Descent algorithm are as follows:

Start with random initial values (weights and biases).
Calculate the loss function.
Calculate the gradient (derivative) of the loss function.
Update the weights and biases by taking a step in the negative direction of the gradient.
Repeat steps 2-4 for a specified number of iterations or until the loss function drops below a certain value.
At the end of this process, the parameters of the ANN are optimized according to the data and the network is "learned" to produce the expected outputs against the input data.

SGD, BatchGD, Mini Batch Gradient Descent

Stochastic, Batch, and Mini-Batch conditions are often used to consume different variations of Gradient Descent programs. These variations determine the amount of data used when updating the weights.

Stochastic Gradient Descent (SGD): In this method, the weights are updated for each training sample. That is, only one instance of each durability is used. This generally provides a faster learning process, but can introduce more noise (i.e. more errors) because only a single individual piece of information is used for its retention.

Batch Gradient Descent: In this method, the entire training set is used in each persistence. That is, it is updated according to all changes of the weights. This generally provides a more stable learning process and causes less noise, but the programming cost is higher and all data must be retained whenever it can be deleted.

Mini-Batch Gradient Descent: This method provides a balance between SGD and Batch Gradient Descent. For each persistence, a small subset (or 'mini-batch') of the training set is used. This combines the entire SGD and the stability of Batch Gradient Descent. Mini batch size usually ranges from 10-1000.

Which method to proceed often depends on the application and the characteristics of the data set used.

Example

When Stochastic Gradient Descent (SGD) is used, weights are updated for each sample in the training set at each epoch. So, if you have 1000 samples in your training set and set the number of epochs to 50, the model's weights will be updated 50,000 times in total (1000 samples x 50 epochs per epoch).

This is one of the reasons why SGD provides faster learning compared to other Gradient Descent variations. However, this rapid learning can often lead to more noise (i.e., more errors) because information from only a single sample is used at each step. For this reason, SGD is often preferred when working with large data sets or when a rapid prototype needs to be created.

Yes, in Batch Gradient Descent, a single update is made on the entire training set at each epoch. So, if you have 1000 samples in your training set and set the epoch number to 50, the model's weights will be updated only 50 times in total.

This is one of the biggest advantages of Batch Gradient Descent because this method generally gives more stable results and introduces less noise (i.e. less error). However, this method usually learns slower and requires more memory because the information of the entire training set is used at each step. Therefore, Batch Gradient Descent is often impractical when working with very large data sets.

Instead, Mini-Batch Gradient Descent is often used. This method provides a balance between Stochastic and Batch Gradient Descent. In Mini-Batch Gradient Descent, multiple updates are made according to the specified batch size in each epoch. For example, if you set the batch size to 32 and your training set has 1000 samples, 1000/32 = approximately 31 updates per epoch. This provides both rapid learning and more stable results.

Forward Propagation && Backpropagation

Forward Propagation: Forward propagation refers to the flow of data from the input layer to the output layer in a neural network model. Each neuron takes the weighted sum of the inputs it receives and applies an activation function. This value is transferred to the next layer. This process continues until it reaches the last layer of the network. At the end of forward propagation, the estimated output of the model is obtained.

Backpropagation: Backpropagation forms the basis of the learning process of a neural network model. The error (loss) between the model's predicted output and the actual output is calculated, and this error is used to update the weights of each neuron by passing the network backwards. This process involves calculating derivatives and using the chain rule to determine how much error is contributed by each neuron. The backpropagation process is used to minimize the error rate and improve the performance of the model.

These two processes form the basis of the training cycle of a neural network model. With forward propagation, the model makes a prediction, with backpropagation, the model evaluates how good this prediction is and updates the weights based on this information. This process is repeated for a specified number of epochs or until a specific stopping criterion is met.

Top comments (0)