loss functions in ml and dl, part 1 : huber loss, quantile loss, tweedie loss, log-cosh loss
in regression models in ml or dl, the choice of the loss function directly influences the stability and accuracy of predictions. a poor selection can make a model too sensitive to outliers, too slow to converge, or unable to capture underlying trends in the data.
some functions, like mse (mean squared error), are ideal when errors follow a normal distribution, but they become problematic in the presence of outliers. others, such as huber loss or log-cosh loss, attempt to combine the advantages of mse and mae for better robustness to outliers. in specific cases, quantile loss allows estimating different percentiles of a distribution instead of the mean, and tweedie loss is particularly effective for modeling asymmetric distributions.
this article analyzes these loss functions in detail, comparing their impact on model training and specifying their optimal application domains. the goal is to understand when and why to use each function to improve the performance of a machine learning model.
⚠️ important: these are not performance metrics. they are used to optimize a model during training, not to evaluate it after training, like rmse or r². some performance metrics, such as mae and mse, can sometimes be used both as validation metrics and loss functions, but not here. these are loss functions meant for model training.
huber loss — a hybrid between mse and mae to handle outliers
ideal use cases:
- regression tasks with noise (outliers)
- models requiring robustness to high variance
- systems with measurement errors (sensors, finance, etc.)
how it works:
- combines the advantages of mse (accuracy for small errors) and mae (robustness to outliers).
- the δ (delta) parameter controls the transition between mse and mae.
impact of the δ (delta) parameter:
- small δ (ex: d = 1, blue) behaves more like mse (better convergence but sensitive to outliers).
- large δ(ex : d = 10, red) behaves more like mae (less sensitivity to outliers but slower convergence).
=> for errors (|y — ŷ|) less than 10, the loss follows a quadratic form (like mse), which means small errors are penalized more strongly.
=> for errors greater than 10, the loss becomes linear (like mae), which prevents outliers from having too much influence on the learning process.
💡 tip: δ is often optimized through cross-validation.
log-cosh loss — a smoother alternative to huber
log-cosh is a loss function similar to huber but with a smoother transition and no critical threshold.
why use it?
- when you need to limit the impact of outliers while keeping a well-defined gradient
- to avoid huber’s threshold effect
- in deep learning, where smoother gradients improve optimization
it behaves like mse for small errors and mae for large errors, but with a more gradual transition, improving training stability.
quantile loss (pinball loss) — capturing quantiles of a distribution
unlike standard loss functions that minimize global error, quantile loss allows estimating specific quantiles of the conditional distribution of a target variable.
ideal use cases:
- uncertainty estimation (quantile regression)
- estimating distribution tails (e.g., extreme risks in finance with τ = 0.05 or τ = 0.95)
- applications in meteorology, risk management, and insurance
tweedie loss — modeling skewed distribution
tweedie loss is designed to handle big asymmetric distributions, particularly poisson-gamma distributions, often found in insurance and accounting (e.g., modeling claim amounts).
ideal use cases:
- when working with asymmetric or count-based data (e.g., insurance claims, demand modeling)
- in actuarial science and insurance to estimate indemnity amounts
- when data contains a large number of zeros (zero-inflated data)
2. can the choice of loss functions be automated?
yes, but with limitations:
- un machine learning, automl (automated machine learning) can test multiple loss functions on a dataset and select the best-performing one.
- in deep learning, you can build a pipeline that experiments with different loss functions and picks the one with the best validation performance.
- however, it is crucial to define a clear evaluation criterion (mae, mse, rmse, mape…) to compare results.
final table
criterions :
- a) sensitivity to small errors
this refers to how the loss function reacts to small prediction errors.
high sensitivity means that even a small error causes a large variation in the loss function (e.g., mse, which heavily penalizes deviations).
low sensitivity means the loss function is less affected by small errors (e.g., mae, which reacts less aggressively to deviations).
- b) outlier robustness :
this refers to a loss function’s ability to not be overly influenced by extreme values (outliers).
a non-robust function (like mse) is strongly affected by extreme values because errors are squared.
a robust function (like mae, huber, or quantile) limits the impact of outliers by reducing the penalty associated with large errors.
conclusion
choosing the right loss function remains a critical decision that depends on the nature of the problem, data distribution, and specific constraints of the model.
the choice of the loss function strongly depends on the specific problem. mse remains an effective default option, but alternatives like mae, huber, and log-cosh provide better robustness to outliers. quantile loss is essential for uncertainty estimation, while tweedie loss is a powerful choice for asymmetric data
Sirine Amrane