reinforcement learning, part 1: introduction

3 min readFeb 9, 2025

reinforcement learning (rl) is the third type of machine learning, allowing an agent to learn how to make optimal decisions through interaction with an environment. unlike supervised and unsupervised learning in machine learning (ml), rl is based on a learning paradigm driven by rewards and penalties, inspired by behavioral conditioning in psychology.

this article provides an introduction of rl, comparing it to traditional machine learning and deep learning, explaining its fundamental mechanisms, applications, and challenges.

1. how reinforcement learning works

rl relies on continuous interaction between an agent and an environment. its goal is to learn a policy that maximizes the sum of accumulated rewards over the long term.

2.1. key components of rl

agent: the entity that makes decisions.
environment: the framework in which the agent operates.
actions (a): possible choices made by the agent.
state (s): representation of the environment at a given time.
reward (r): feedback signal indicating the quality of an action.
policy (π): strategy that defines the action to be taken in each state.
value function (v(s)): estimates the quality of a given state.
action-value function (q(s, a)): estimates the quality of an action in a given state.

rl is often modeled as a markov decision process (mdp), defined by a tuple (s,a,p,r), where p(s′∣s,a) represents the probability of transitioning between states.

reminder : the three types of machine learning

machine learning (ml) consists of three main types:

supervised learning → the model is trained on labeled data (ex., image classification, price prediction).
unsupervised learning → the model identifies hidden patterns in unlabeled data (ex, clustering, dimensionality reduction).
reinforcement learning (rl) → an agent learns through trial and error by interacting with an environment to maximize cumulative rewards.

how rl differs from other ml types

unlike supervised learning, rl does not rely on static datasets (!) but actively interacts with its environment, receiving reward-based feedback. it also considers long-term consequences of actions rather than optimizing for immediate outputs.
* unlike unsupervised learning, rl does not just find patterns but explores autonomously to improve decision-making.

2. the three categories of rl models

a) value-based models

these models approximate a value function v(s) or q(s,a) to make optimal decisions.

b) policy-based models

these models learn the optimal policy directly instead of estimating a value function.

c) multi-agent rl and adversarial learning

these advanced models are used in complex environments where multiple agents interact.

3. when to use reinforcement learning (rl) and when it is unnecessary

when to use rl?

✅ the environment is dynamic and uncertain
→ if your problem requires continuous adaptation, such as financial markets, games, real-time cybersecurity, or robotics.

✅ the problem involves dependent sequences of actions
→ if a decision at one moment affects future decisions (ex, trading, multi-stage cyber attacks, industrial planning).

✅ there is a notion of delayed rewards
→ if feedback is not immediate (ex, an attack algorithm that remains undetected for days before succeeding).

✅ rules are not explicit or evolve over time
→ if the problem is ill-defined, constantly evolving, or strategic (ex., responding to zero-day attacks, adversarial ai in finance).

when not to use rl?

❌ the problem is static and well-defined
→ if it can be solved by a simple supervised algorithm (ex, email classification as spam or non-spam).

❌ supervised learning is sufficient
→ if you have a large labeled dataset, a simple supervised model (cnn, xgboost, transformer) will be faster and more efficient than rl.

❌ learning costs are too high
→ rl often requires millions of iterations and complex simulations before learning (ex, an rl model for high-frequency trading takes weeks to train on gpus).

❌ the problem does not involve sequential decision-making
→ if prediction is independent of previous decisions, rl is unnecessary (ex&, facial recognition, standard nlp tasks).

4. conclusion

reinforcement learning is a powerful approach that enables agents to learn through interaction and optimize decision-making in dynamic environments. unlike traditional machine learning, rl focuses on sequential decision-making and long-term rewards, making it particularly effective in fields like finance, cybersecurity, robotics...

as rl continues to evolve, advancements in policy optimization, multi-agent learning, and adversarial training will push its applications even further. in the next part, we will explore the core algorithms that drive rl, from value-based methods like deep q-networks to policy-gradient techniques.

Sirine Amrane