activation functions, part 3 : linear and softplus for regression output layer in dl
in deep learning, choosing the right activation function for regression or classification is essential. the linear activation is the standard choice for regression as it allows unrestricted predictions, making it ideal for financial forecasts and continuous outputs. softplus, a smoother alternative to relu, ensures strictly positive outputs while avoiding abrupt cutoffs, making it useful for predicting quantities like volatility or energy consumption
1. no activation function (linear activation)
it is THE standard choice for dll models applied to regression (like LSTMs, CNNs, MLPS…). in reality, a linear activation function means that no transformation is applied to the output, which practically means not using any activation function in the output layer.
formula:
f(x)=x
exemple :
- if the last layer of your network is a neuron with a linear activation, then its output will simply be : y = W = x + b.
without any additional activation, the model can predict any real value.
when to use it?
- standard regression where the target variable is continuous and unbounded, meaning it can take any real value (e.g., prices, temperature, financial values…).
OR
2. deep learning models for time series regression (e.g., economic forecasts).
why use it?
as explained above, a linear activation in the output layer ensures the absence of constraints, meaning the predictions remain in an unrestricted range. this means the output value is not artificially limited to a specific interval and can take any real value (positive, negative, or zero).
this is in contrast to other activation functions such as sigmoid, which restricts outputs between 0 and 1, or relu, which forces outputs to be only positive. with linear activation, there are no limits, avoiding unnecessary transformations of the data.
1. softplus activation
it is often used as an alternative to relu because it avoids the discontinuity at x=0x = 0x=0 while maintaining a strictly positive output. unlike relu, which truncates all negative values to zero, softplus allows a smoother transition for negative inputs, which can be beneficial for gradient propagation in a neural network.
formula :
f(x) = ln(1+e^x)
exemple :
example:
you want to predict the volatility of a financial asset (which is always positive).
if you used a linear activation, the model could predict negative values, which would not make sense.
by using softplus, you ensure that the prediction remains positive while avoiding the abrupt cutoffs of relu.
when to use it?
- when the output must be strictly positive.
- when you want to avoid the discontinuity of relu for negative values.
- in models where gradient regularization is important for more stable convergence.
why use it?
✅ always positive output: if the target variable cannot be negative (e.g., volumes, prices, energy consumption).
✅ avoids the dead neuron problem: unlike relu, where neurons can be completely deactivated for x<0
✅ well-defined gradients everywhere: improves training stability in certain models
conclusion
the linear activation is the best choice for standard regression in dl as it imposes no constraints on predictions. softplus is useful when outputs must be strictly positive, providing smoother gradients than relu. while relu remains more efficient, softplus offers stability in specific cases requiring positive continuous outputs.
Sirine Amrane