activation functions, part 2 : tanh, relu, leaky relu for hidden layers in dl
in neural networks, one crucial component is often overlooked: the activation function. in the first part, we explored the sigmoid function, which is mostly used in output layers for classification. this time, we will focus on two of the most widely used activation functions in hidden layers: tanh and ReLU. these functions play a key role in helping neural networks learn complex patterns, making them essential for deep learning models across various tasks.
a reminder of the importance of the gradient and the problem of its disappearance with sigmoid
before going into the details of the functions, we need to understand why some activations cause problems in training neural networks, especially in deep architectures.
neural networks are trained with gradient descent: indeed, when training a neural network, we use gradient descent to adjust the weights and improve predictions. this means we calculate how the error changes depending on the weights and adjust them little by little in the right direction.
we compute the derivative of the loss function with respect to the weights to determine in which direction to adjust them to improve predictions. the problem is that certain activation functions, like sigmoid and tanh, have very small derivatives for extreme input values (close to 0 or 1), as you may remember.
it is useful for probabilities but not for learning (hello, disappearing gradient 👋). this phenomenon leads to the vanishing gradient problem.
the sigmoid function is defined by:
formula of the sigmoid :
the problem with the sigmoid is that it compresses extreme values too much.
look at its derivative (its curve/slope), it becomes very small when we are in extreme values, meaning close to 0 or 1.
formula of the derivative :
o’(x) = o(x) (1-o(x))
in a deep network, we must propagate this gradient through several layers via backpropagation. if each layer reduces the gradient intensity a bit more (because the sigmoid “squashes” values towards 0 and 1), then after several layers, the gradient becomes almost zero.
consequence:
- the first layers learn almost nothing,
- optimization is very slow,
- the network can stagnate and never converge.
that’s why the sigmoid is rarely used in deep networks, except for specific tasks like binary classification (in the last layer).
tanh: an improved version of the sigmoid
tanh is the same idea as sigmoid… but better.
the tanh function (hyperbolic tangent) is similar to the sigmoid but transforms input values into a range between -1 and 1, instead of 0 and 1.
formula of tanh :
formula of the derivative :
tanh’(x) = 1 − tanh²(x)
concretely, it is centered around 0 (instead of being between 0 and 1). it allows better normalization of data, which helps make learning more stable.
indeed, when the mean of activations is close to zero, weight updates are more efficient. but as soon as we reach extreme values (very large or very small), the hyperbolic tangent tends toward -1 or 1, meaning its derivative becomes close to zero.
as a result, the problem persists, even if it is less severe than with the sigmoid.
tanh is therefore a better alternative to the sigmoid but is not ideal for deep networks.
relu: the revolution of deep learning
if you look at modern models, relu (rectified linear unit) is everywhere. it is the most used activation function in deep learning.
it is incredibly simple:
formula of relu :
relu(x) = max(0,x)
in other words:
- if x>0, then relu(x) = x
- if x≤0, then relu(x) = 0
if your input is negative, relu returns 0. otherwise, it returns the same value. no complicated fractions, no exponentials.
its derivative is just as simple:
formula of the derivative :
limits of relu
obviously, there is also a downside: neurons can “die.”
if a neuron always receives negative values, it will always return zero and will never learn again.
this is called the dying relu problem.
to avoid this issue, there are some variants that we will learn in the next article. some examples :
- leaky relu: instead of returning 0, a small negative value is allowed to pass through, so neurons with negative values can still learn, even slowly.
- prelu or parametric relu: instead of having a fixed coefficient like 0.01 in leaky relu, this coefficient is learned during training.
formulas of their derivatives :
conclusion on relu
- it makes the network more sparse (many neurons are set to zero), which sometimes helps generalization.
- it does not saturate: unlike the sigmoid and tanh, relu does not squash positive values. as a result, the gradient remains significant, allowing much faster learning.
- it is very efficient: no need to calculate exponentials or fractions. relu is just a comparison between a number and zero.
- it avoids the vanishing gradient problem for positive values. as long as a neuron receives positive inputs, it keeps learning.
who to choose between tanh and relu?
👉 if you have a simple or shallow network (2–3 layers), tanh can be a good choice because it handles negative values better.
👉 if you are doing deep learning (cnn, transformers, etc.), relu is clearly the standard because it avoids the vanishing gradient problem, is faster and more efficient.
but be careful: depending on the nature of the datasets, vairants of relu (like prelu or leaky relu) can be better alternatives if you want to avoid some neurons dying. we will explore it in the next article.
that’s it! now you know why relu dominates deep learning and how tanh is still useful for some cases.
Sirine Amrane