optimization algorithms in ml : gradient descent — and dl : stochastic gradient descent, adam

Sirine Amrane
8 min readFeb 3, 2025

--

1. what is an optimization algorithm?

an optimization algorithm is a mathematical method used to find the best possible values for a set of variables by minimizing (or maximizing) a given objective function.

  • in machine learning : it optimizes model parameters to reduce errors,
  • while in deep learning, it adjusts millions of weights and biases through gradient-based methods like SGD or Adam (that we will see) to improve neural network performance

2. are optimization algorithms used in dl or in ml?

optimization algorithms are used in both ml and dl but in a different way.

in ml

  • used to minimize a cost function in models like linear regression, svm, and xgboost.
  • examples: gradient descent, newton’s method, l-bfgs.

in dl

  • essential for training neural networks through backpropagation.
  • examples: sgd (stochastic gradient descent), adam, rmsprop.

📌 key difference:

  • in ml, optimization methods are often analytical or based on simple gradients.
  • in dl, they are more complex as they optimize millions of parameters in deep neural networks.

💡 conclusion: optimization algorithms exist in both, but they are essential and more advanced in deep learning.

example: training a neural network

when a deep learning model learns, it adjusts its weights to minimize a cost function, which measures the discrepancy between its predictions and actual values. optimization algorithms guide this weight update process so that the error gradually decreases, improving the model’s performance.

real-world example: portfolio allocation in finance

an investor seeks to allocate capital among different stocks to achieve the best return on investment while limiting risk. an optimization algorithm can test various allocations and identify the optimal combination based on predefined criteria.

how to compare optimization algorithms?

engineers compare optimization algorithms based on three key factors:

  1. convergence speed o(.) — the order of magnitude that describes how quickly (in terms of the number of iterations) the algorithm reaches an optimal solution.
  2. stability — the impact of hyperparameters (especially the learning rate, the most critical hyperparameter).
  3. robustness — performance in the presence of noise and local optima.

3. fundamentals: what is a gradient?

in mathematics, deep learning and machine learning, the gradient is a measure of a function’s slope. it indicates in which direction and with what intensity a variable should be modified to increase or decrease the function’s value.

example: if a function represents a mountain, the gradient is the direction where the slope is steepest.

mathematical definition:

  • for a function f(x) with a single variable, the gradient is simply the derivative, which indicates the slope at a given point.
  • for a function f(x, y, …) with multiple variables, the gradient is a vector containing all partial derivatives.

visual example:

imagine a hill:

  • if you are at the top, the gradient is close to zero (no slope).
  • if you are on a steep slope, the gradient is large (steep descent).
  • if you want to go downhill, you must move in the opposite direction of the gradient.

machine learning part

1. the most basic optimization algorithm for ml: gradient descent (gd)

gradient descent (gd) is the primary algorithm used in ml to minimize a cost function by gradually adjusting variables in the opposite direction of the gradient. its goal is to adjust weights to minimize the chosen cost function (such as the error between predictions and actual values).

b) key points:

  • uses all data to compute parameter updates at each iteration.
  • follows a stable trajectory but can be very slow for large datasets.

c) analogy:

imagine you want to determine if a restaurant is good. you interview all customers before forming an opinion (very accurate but slow).

d) process:

  1. initialize random values for the parameters (e.g., neural network weights).
  2. compute the gradient of the cost function with respect to the parameters.
  3. update the parameters following this rule:

θt+1=θt−α∇f(θt)

where :

  • θt is the current parameter
  • α is the learning rate
  • ∇f(θt) is the gradient

4. repeat the process over the entire dataset until the cost function is minimized (it converges).

deep learning part

1. an improved version of gd for dl : stochastic gradient descent (sgd)

stochastic gradient descent (sgd) is a variation of classic gradient descent that updates model parameters after each data sample instead of waiting to compute the gradient on the entire dataset. this speeds up optimization, especially for large datasets, but at the cost of higher variance in updates.

unlike standard gradient descent, which follows a smoother trajectory toward the global minimum, sgd follows a more irregular path but can escape local minima more easily — an advantage in training deep neural networks. however, this instability can also slow convergence, which is why variants like momentum, rmsprop, and adam were developed to improve robustness and efficiency.

a) key points:

  • updates parameters after each data sample, without waiting to process the whole dataset.
  • much faster, but updates are noisy and unstable.

b) analogy:

you ask only one restaurant customer for their opinion and decide immediately if the restaurant is good or not (fast but not always reliable).

c) process:

  1. choose a single sample or mini-batch instead of the entire dataset.
  2. compute the gradient of the cost function for this sample.
  3. immediately update the parameters after each sample or mini-batch, allowing multiple updates per iteration.
  4. repeat the process across many samples from the dataset

2. the most widely used algorithm for dl: adam

adam is the most widely used optimization algorithm for training deep learning models.
it improves classical gradient descent by dynamically adjusting the learning rate for each model parameter.

a) key points:

adam = sgd + momentum + rmsprop
adam combines two ideas to make optimization more efficient and stable:

  1. momentum — keeps an average of past gradients to smooth parameter updates and prevent excessive oscillations.
  2. rmsprop — dynamically adjusts the learning rate based on the variance of the gradients for each parameter, preventing it from being too high or too low.

b) analogy:

you consider a running average of past reviews and adjust your judgment based on the reliability of the sources.

c) process:

  1. compute the gradient of the cost function.
  2. adam maintains moving averages of both: =>first moment (mean) of gradient => second moment (variance) of gradients
  3. apply bias corrections to compensate for initially biased estimates.
  4. update parameters using the estimated moments and an adaptive learning rate for each parameter.
  5. repeat the process with adaptive updates, allowing adam to converge faster and more efficiently in most cases.

3. several tests

example 1 : testing on a easy (quadtraic function) with different different optimization algorithms (here adam, sgd, and gd) and visualize which one converges the fastest
- we define a simple quadratic loss function, f(x) = (x-3)²
- we train the model with the 3 algorithms adam, sgd, and gd
- we plot their convergence curve to see which one converges the fastest
- we compare their performances by plotting the loss over the iterations

import torch
import matplotlib.pyplot as plt
# loss fonction à minimiser : f(x) = (x - 3)^2
def loss_function(x):
return (x - 3) ** 2
# fonction pour entraîner un modèle avec un optimiseur donné
def train_optimizer(optimizer_name, learning_rate=0.1, epochs=100):
x = torch.tensor([5.0], requires_grad=True) # Initialisation
optimizers = {
"GD": torch.optim.SGD([x], lr=learning_rate),
"SGD": torch.optim.SGD([x], lr=learning_rate, momentum=0), # Version classique sans momentum
"Adam": torch.optim.Adam([x], lr=learning_rate)
}

optimizer = optimizers[optimizer_name]
losses = []
for _ in range(epochs):
optimizer.zero_grad()
loss = loss_function(x)
loss.backward()
optimizer.step()
losses.append(loss.item())
return losses
# les paramètres
epochs = 50
learning_rate = 0.1
# entrainer avec chaque optimiseur
losses_gd = train_optimizer("GD", learning_rate, epochs)
losses_sgd = train_optimizer("SGD", learning_rate, epochs)
losses_adam = train_optimizer("Adam", learning_rate, epochs)
# afficher des courbes de convergence
plt.figure(figsize=(8, 5))
plt.plot(losses_gd, label="GD (batch gradient descent)", linestyle="--")
plt.plot(losses_sgd, label="SGD (stochastic gradient descent)", linestyle="-.")
plt.plot(losses_adam, label="Adam", linestyle="-")
plt.xlabel("Épochs")
plt.ylabel("Loss")
plt.title("Comparaison des algorithmes d'optimisation")
plt.legend()
plt.grid()
plt.show()

conclusion : GD et SGD étaient aussi rapides qu’Adam → aucun avantage clair pour une fonction simple et SGD suffisent largement, Adam est inutile

example 2 : testing on a more difficult function, with local minima : f(x)=sin(5x)+0.5x²

# -*- coding: utf-8 -*-
import torch
import matplotlib.pyplot as plt

# Fonction avec oscillations et courbure complexe
def loss_function(x):
return torch.sin(5*x) + 0.5*x**2 # Fonction difficile pour GD/SGD

# Fonction pour entraîner un modèle avec un optimiseur donné
def train_optimizer(optimizer_name, learning_rate, epochs, initial_x, use_scheduler=False):
x = torch.tensor([initial_x], requires_grad=True)
optimizers = {
"GD": torch.optim.SGD([x], lr=learning_rate),
"SGD": torch.optim.SGD([x], lr=learning_rate, momentum=0),
"Adam": torch.optim.Adam([x], lr=0.1, betas=(0.9, 0.999)) # Adam garde un lr élevé
}

optimizer = optimizers[optimizer_name]
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=50, gamma=0.8) if use_scheduler else None

losses = []
for _ in range(epochs):
optimizer.zero_grad()
loss = loss_function(x)
loss.backward()
optimizer.step()
if scheduler:
scheduler.step()
losses.append(loss.item())

return losses

# Nouveaux paramètres
epochs = 200
learning_rates = {"GD": 0.01, "SGD": 0.01} # On réduit le lr pour éviter la divergence
initial_x = 5.0 # On part loin pour voir l'effet d'Adam

# Entraînement avec chaque optimiseur
losses_gd = train_optimizer("GD", learning_rates["GD"], epochs, initial_x)
losses_sgd = train_optimizer("SGD", learning_rates["SGD"], epochs, initial_x)
losses_adam = train_optimizer("Adam", 0.1, epochs, initial_x, use_scheduler=True) # Adam avec un grand lr

# Affichage des courbes de convergence
plt.figure(figsize=(8, 5))
plt.plot(losses_gd, label="GD (batch gradient descent)", linestyle="--")
plt.plot(losses_sgd, label="SGD (stochastic gradient descent)", linestyle="-.")
plt.plot(losses_adam, label="Adam", linestyle="-")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.title("Comparaison des algorithmes sur une fonction plus difficile")
plt.legend()
plt.grid()
plt.show()

conclusion :

  • GD and SGD got stuck in a local minimum (loss 10), unable to escapewhereas
  • adam adapted and found a better solution, au lieu de s’arrêter dans un premier minimum, il explore plus loin + on a réduit progressivement son learning rate pour ne pas qu’il oscille à l’infini avec (). il finit avec une loss beaucoup plus basse (2) contre 10 pour GD/SGD, il a su passer au-delà des pièges des minima locaux

in summary :
- GD: update after computing the gradient on the entire dataset.
- SGD: update after computing the gradient on a single sample or mini-batch.
- Adam: combination of SGD with adaptive moments for better performance in complex environments.

4. combining multiple algorithms for better performance

engineers don’t rely on just one optimization algorithms, they combine them intelligently.

example :

adam + lbfgs: use adam for early iterations, then switch to lbfgs to refine convergence.

conclusion

optimization algorithms are everywhere in deep learning, machine learning and many other fields. you have to choose based on context:

  • adam for complex problems (deep learning)
  • SG and SGD for easy problems (linear models)

mastering these tools and deeply understanding how they work can make all the difference in optimizing models with precision.

Sirine Amrane

--

--

Sirine Amrane
Sirine Amrane

No responses yet