## DEV Community is a community of 670,477 amazing developers

We're a place where coders share, stay up-to-date and grow their careers.

# The Perfect Activation

Pol Monroig Company
Love teaching robots

It might be too bold to call an activation function perfect, given that the No Free Lunch Theorem of machine learning states that there is no universally perfect machine learning algorithm. Nevertheless, as misleading as the title can be, I will try to summarize the most widely used activation functions and describe their main differences.

# Linear (identity)

The linear activation function is essentially no activation at all.
Overhead: fastest, no computation at all
Performance: bad, since it does not enable a non linear transformation

• Differentiable at all points
• Fast execution

Common issues:

• Does not provide any non-linear output.

# Sigmoid

The Sigmoid activation function is one of the oldest ones. Initially made to mimic the activations in the brain it has been shown to have poor performance on artificial neural networks, nevertheless it is commonly used and a classifier output to transform outputs into class probabilities.

Uses: it is commonly used in the output layer of binary classification where we need a probability value between 0 and 1.
Overhead: very expensive because of the exponential term.
Performance: bad on hidden layers, mostly used on output layers

• Outputs are between 0 and 1, that means that values won't explode.
• It is differentiable at every point.

Common issues:

• Outputs are between 0 and 1, that means outputs might saturate.
• Outputs are always positive ( zero centered functions help in a faster convergence).

Code:

# Pytorch
torch.nn.Sigmoid()
# Tensorflow
tf.keras.activations.sigmoid()

# Softmax

Generalization of the Sigmoid function to more than one class, it enables to transform the outputs into multiple probabilities. Used in multiclass classification.
Uses: used in the output layer of a multiclass neural network.
Performance: bad on hidden layers, mostly used on output layers

• Unlike Sigmoid, it ensures that outputs are normalized between 0 and 1

Common issues:

• Same as Sigmoid.

Code:

# Pytorch
torch.nn.Softmax(dim=...)
# Tensorflow
tf.keras.activations.softmax()

# Hyperbolic Tangent

Tanh function has the same shape as Sigmoid, in fact is the same but it is mathematically shifted and it works better in most cases.
Uses: generally used in hidden layers as it outputs between -1 and 1, thus creating normalized outputs, making learning faster.
Overhead: very expensive, since it uses an exponential term.
Performance: similar to Sigmoid but with some added benefits

• Outputs are between -1 and 1, that means that values won't explode.
• It is differentiable at every point.
• It is zero-centered, unlike Sigmoid.

Common issues:

Code:

# Pytorch
torch.nn.Tanh()
# Tensorflow
tf.keras.activations.tanh()

# ReLU

ReLU, also called rectified linear unit is one of the most commonly used activations, both for its computational efficiency and its great performance. Multiple variations have been created to improve its flaws.
Uses: must be used in hidden layers as it provides better performance than tanh and Sigmoid, and is more efficient since it is computationally faster.
Performance: great performance, recommended for most cases.

• Adds non-linearity to the network.
• Does not suffer from vanishing gradient.
• Does not saturate.

Common issues:

• It suffers from dying ReLU
• Not differentiable at x = 0

Code:

# Pytorch
torch.nn.ReLU()
# Tensorflow
tf.keras.activations.relu()

# Leaky Relu

Given that ReLU suffers from the dying relu problem where negative values are rounded to 0. Leaky ReLU tries to diminish the problem by changing the 0 output by a very small value.
Uses: used in hidden layers.
Performance: great performance if the hyperparameter is chosen correctly

• Similar to ReLU and fixes dying ReLU.

Common issues:

• New hyperparameter to tune.

Code:

# Pytorch
torch.nn.LeakyReLU(negative_slope=...)
# Tensorflow
tf.keras.layers.LeakyReLU(alpha=...)

# Parametric ReLU

Takes the same idea as leaky ReLU but instead of predifining the leaky hyperparemeter, it is added as a parameter that must be learned.
Uses: used in hidden layers.
Overhead: a new parameter must be learned for each PreLU in the network.
Performance: bad on hidden layers, mostly used on output layers

• Fixes the need of tuning an hyperparameter

Common issues:

• The parameter learned is not guaranteed to be the optimum, and it increases the overhead, so you might as well try some yourself with leaky.

Code:

# Pytorch
torch.nn.PReLU(x)
# Tensorflow
tf.keras.layers.PReLU(x)

# ELU

The ELU was introduced as another alternative to fix the issues that you can encounter with ReLU.
Uses: used in hidden layers
Overhead: computational expensive, it uses an exponential term
Performance: bad on hidden layers, mostly used on output layers

• Similar to reLU.
• Produces negative outputs.
• Bends smoothly unlike leakyReLU.
• Differentiable at x = 0

Common issues:

Code:

# Pytorch
torch.nn.ELU()
# Tensorflow
tf.keras.activations.elu()

# Other alternatives

There are a lot of activations functions to cover them all in a single post. Here are some:

• SeLU
• GeLU
• CeLU
• Swish
• Mish
• Softplus

Note: if it ends with LU it usually comes from ReLU.

# Summary

So... having so many choices, which activation should we use? As a rule of thumb you should always try using ReLU in the hidden layers, as it has a great performance with minimal computational overhead. After that (if you have enough computing power) you might want to try with some complex variations of ReLU or similar alternatives. I would never recommend using Sigmoid, Tanh or Sotfmax for any hidden layer. Sigmoid and Softmax should be used whenever we want probabilities outputs for a classification task. Finally, with the current progress and research in deep learning and AI surely new and better functions will appear, so keep an eye out.

Remember to try and experiment always, you never know which function will work better for a specific task.