How do you beat overfitting in Machine Learning ? Part 2 : L1 and L2 Regularization

7 min readJan 24, 2025

Overfitting occurs when a machine learning model becomes too specialized in the training data, capturing not only the underlying patterns but also the noise and irrelevant details. This results in a model that performs exceptionally well on the training dataset but fails to generalize to new, unseen data. To combat overfitting, we use various techniques to simplify the model and improve its ability to generalize. In this article, we’ll explore two fundamental methods of regularization: L1 (Lasso) and L2 (Ridge). We’ll also introduce Elastic Net, a hybrid approach combining the best of both worlds

L1 (Lasso) and L2 (Ridge) regularizations are fundamental techniques in machine learning and artificial intelligence for reducing overfitting. These methods are applied across a wide range of models, including classical algorithms such as linear regression, logistic regression, and Support Vector Machines (SVM), as well as deep learning architectures. However, in deep learning models, which manage a significantly higher number of parameters, these regularization techniques are often combined with other methods.

L1 and L2 regularizations are also employed to control the weights of connections between neurons in neural networks. In deep learning, they are frequently paired with additional techniques like dropout, discussed in the previous article, which randomly “deactivates” neurons during training to improve generalization.

The Solution: L1 and L2 Regularization

To prevent your model from “memorizing” the training data excessively, you need to impose some discipline.

How? By adding a penalty term to the model’s loss function. This penalty depends on the size of the model’s weights (wi)

Fonction de perte classique

As a reminder, when training a model, the objective is to minimize a loss function. There are several possible choices for this function; one common example is the Mean Squared Error (MSE):

MSE (Mean Squared Error) : méthode des moindres carrés, utilisée pour minimiser l’erreur quadratique moyenne entre les prédiction et les données réelles

où :

yi is the actual value for observation i
y^i is the predicted value for observation i,
n is the total number of observations

This function measures how well your model predicts the output. However, if left unchecked, the model might minimize the loss at all costs, even by inflating certain weights. That’s where regularization comes in.

Adding Regularization

L1 Regularization (lasso)

A penalty proportional to the sum of the absolute values of the weights is added:

Effect: L1 regularization forces some weights to become exactly zero (sparsity), effectively eliminating certain features.
Advantage: It reduces complexity and performs automatic feature selection.
Limitation: It may struggle when features are highly correlated

L2 Regularization (ridge)

A penalty proportional to the sum of the squared weights is added:

Effect: L2 regularization reduces the weights but does not set them to zero, meaning all features are retained.
Advantage: Ideal when all features are relevant and helps prevent excessively large weights.
Limitation: Does not perform feature selection as strictly as L1.

Here are the regularization formulas :

Why Does Regularization Work Against Overfitting?

When you add L1 or L2 regularization to your loss function, these penalties constrain the model. It can no longer minimize the error by inflating weights, which forces the model to find a simpler solution that generalizes better.

In simple terms: A regularized model avoids “cheating” by amplifying weights for specific details in the training data.
Result: Less reliance on specific data points, leading to better generalization.

3. Visual Impact of L1 and L2

Imagine a 3D graph where each axis represents a weight. The model tries to minimize the loss function in this space.

With L1: The penalty creates a diamond-shaped constraint. The optimal point often lies on an axis, making some weights exactly zero.
With L2: The penalty creates a spherical constraint. The optimal point is inside the sphere, leading to smaller but non-zero weights.

4. Practical Example: Predicting House Prices

Imagine a regression model predicting house prices using features like:

Surface area (aaa),
Number of rooms (bbb),
Proximity to the city center (ccc),
Proximity to schools (ddd),
Year of construction (eee).

Without Regularization

The model might assign very high weights to certain irrelevant variables (e.g., a specific year that only applies to training data).

With L1

L1 regularization could eliminate unimportant variables, such as ccc and ddd, and focus on the most impactful features like aaa and bbb.

With L2

L2 regularization would reduce the weights for all variables, keeping ccc and ddd but with less influence.

5. Limitations and Precautions

=> Properly adjusting the hyperparameter λ :

λ controls the strength of regularization. When you add regularization (L1 or L2) to your loss function, λ is the weight you assign to this penalty.

If λ is too low, the regularization will have no effect.
If λ is too high, the model will become underfitted (it will not be complex enough to capture real patterns).
Simply put: The larger λ, the stronger the regularization. Your model will be penalized more heavily for having large weights.

=> Data Quality: Regularization is not a magic wand. If your data is noisy or poorly prepared, it will not be enough to solve the issues.

=> Noisy Data:

As a reminder, noisy data contains irrelevant or erroneous information that does not reflect true patterns.
Example: You want to predict house prices, but:

One house has an unusually high price (a data entry error).
Another house is incorrectly recorded as having a surface area of “5000 m²” when it is actually 50 m².

These anomalies introduce noise: your model might learn these errors as if they were actual patterns, degrading its performance.

=> Poorly Prepared Data:

As a reminder, this means your data has not been properly cleaned or normalized before training the model. Here are some examples of poor preparation:

Different Scales: If surface area is in m² (large values) and the number of rooms is between 1 and 5, the model might give more importance to the surface area just because of its scale.
Missing Values: Some houses lack information about the construction year. If certain values are missing, this can cause technical errors or be interpreted as 0, which completely skews calculations.
Irrelevant Data: If you include useless columns (e.g., wall color), your model may get distracted and perform worse.

How does it work concretely?

Let’s take an example with L2 (Ridge):

If λ = 0 : there is no regularization, and your model behaves as if regularization does not exist.
If λ is very large : he regularization is so strong that the model will try to reduce all weights to zero, even if it loses precision.
If λ is well-chosen: the weights are small enough to avoid overfitting but not so small that the model stops learning useful patterns.

Practical Tip: We often choose λ using cross-validation to find the best balance between underfitting (not complex enough) and overfitting (too complex).

6. The Alliance of L1 and L2: Elastic Net

Elastic Net is a regularization method that combines the strengths of L1 (Lasso), strict selection, and L2 (Ridge), stability.

It is a hybrid solution that seeks to take the best of both worlds.

Why use Elastic Net?

Elastic Net is particularly useful in the following cases:

Correlated Variables: Lasso alone (L1) may not perform well if several explanatory variables are highly correlated (e.g., distance to a school and distance to a city center). Elastic Net keeps these variables while regularizing them.
Feature Selection: Lasso alone (L1) can select a small number of relevant variables but may ignore some that make weak contributions. Elastic Net retains relevant variables thanks to L2 while reducing noise through L1.
Improving Performance: L1 selects the relevant variables (by setting some weights to 0), and L2 reduces the weights of the selected variables to avoid extreme values

7. Conclusion

Regularizing a model with L1 or L2 is like putting an overindulgent athlete on a diet: it makes the model more efficient and less reliant on unnecessary excesses. Here’s a summary of their strengths. Regularization with L1 and L2 is very simple to implement, works effectively across various contexts and models, and reduces overfitting while preserving the model’s ability to learn from the data.

L1 (Lasso):

Reduces complexity.
Performs feature selection.

L2 (Ridge):

Reduces variance.
Stabilizes weights.

Elastic Net:

Combines L1 and L2 to handle variable correlations and overcome the limitations of each method.
Encourages sparsity (some weights become exactly 0 => ideal for feature selection).
Reduces the magnitude of weights without setting them to 0 => ideal for managing correlated variables.

By mastering these techniques, you can build machine learning models that are both powerful and robust, capable of generalizing well across a wide variety of datasets.

Sirine Amrane