linear regression and logistic regression : introduction
linear and logistic regressions are two of the most widely used supervised learning techniques in data analysis and machine learning. although they share some similarities, they are designed to solve different problems and rely on distinct mathematical hypothesis. in this article, we will explore their principles, applications, and key differences
What is linear regression ?
Linear regression is a statistical method used to model the relationship between a dependent (or target) variable and one or more independent (or explanatory) variables. It is based on the assumption that this relationship is linear, meaning it can be represented by a straight line in a two-dimensional space (or a hyperplane in higher dimensions).
- It outputs a continuous value.
- Example: A temperature, an age, a height, time, a price, or a salary
What is Logistic Regression?
Logistic regression, despite its name, is a method adapted to classification problems. It is used to predict a categorical (often binary) dependent variable based on one or more independent variables.
Unlike linear regression, which outputs a continuous value, logistic regression predicts probabilities of belonging to a category.
- It outputs a categorical variable in the form of a probability (between 0 and 1).
- Example: For an email classification problem as “spam” or “non-spam,” logistic regression could output a probability of 0.82. This means there is an 82% chance the email is spam. If the threshold is set at 0.5, the email would be classified as “spam.”
Target Variable
The target variable is what you want to predict or explain in your model.
- For linear regression:
It is a prediction problem, so the target variable is the output or result.
It is influenced by the explanatory (or independent) variables, such as the size or location of a house. - For logistic regression:
It is a classification problem, so the target variable is a “yes” or “no” (e.g., whether a transaction is fraudulent or not).
The model predicts a probability, which is then converted into a class (e.g., fraudulent if probability > 50%).
Coefficient
In regression (linear or logistic), a coefficient represents the effect that an explanatory (independent) variable has on the target (dependent) variable. It is a parameter that the model learns during training.
The coefficients are not “manually chosen” by the user. They are automatically calculated by the learning model using a mathematical optimization process based on the data.
However, the user can influence this process indirectly by:
- selecting explanatory variables.
- choosing the regularization method (ridge, lasso, elastic net, as discussed in the previous articles).
In linear regression:
- The coefficients are calculated to minimize the sum of the squared errors (error = actual x — predicted y).
- The coefficient indicates how a one-unit change in the explanatory variable affects the target variable.
Example: If the coefficient for the size of a house is 2000, it means that an increase of 1 m² in area leads to an increase of €2000 in the price of the house.
In logistic regression:
- The coefficients are calculated by maximizing the log-likelihood so that the predicted probabilities best match the observed classes, using optimization techniques (gradient descent, etc.).
- The coefficients influence the log-odds (logarithm of the odds ratio) of a class.
Example: If the coefficient of a variable is 1.51, it means that a one-unit increase in this variable multiplies the probability of belonging to a class by e¹.5 ≈ 4.5.
The coefficients are therefore indicators of the strength and direction of the relationship between each explanatory variable and the target.
Handling multiclass problems:
Linear regression does not logically handle multiclass problems.
Logistic regression can be extended to handle multiclass problems through two techniques: multinomial logistic regression and OvR.
Multinomial Logistic Regression:
- This method trains a single model for all classes.
- The model directly predicts the probabilities of each class simultaneously using a softmax function.
- The probabilities are calculated for all classes at the same time, and the sum of the probabilities is always equal to 1.
Example: You want to classify a fruit into three categories: Apple, Orange, and Banana.
The model directly predicts:
- P(Apple(x)) = 0.6
- P(Orange(x)) = 0.3
- P(Banana(x)) = 0.1
The fruit will be classified as Apple (maximum probability).
2) OvR (One-vs-Rest):
The model treats the highest probability class as positive and the rest as negative, selecting the single positive class. For example, to classify an animal as Cat, Dog, or Bird, the model for Cat considers it positive and all other classes negative.
- This method trains a binary model for each class, focusing on a specific task. The predictions of these models are then combined to determine the final class.
- Each model focuses on one specific class and treats all other classes as a single “non-class” category.
- At the end, the probabilities from all models are compared, and the class with the highest probability is selected.
Example:
You want to classify a fruit into three classes/categories: Apple, Orange, and Banana.
- Model 1: Is it an Apple (yes or no)?
Predicted probability = P = 0.2 - Model 2: Is it an Orange (yes or no)?
Predicted probability = P = 0.7 - Model 3: Is it a Banana (yes or no)?
Predicted probability = P = 0.1
The fruit will be classified as Orange (the highest probability among the models).
Limitations and Considerations of Linear and Logistic Regressions
1) Hypothesis
Assumptions are underlying conditions that each regression model considers to be true. These assumptions ensure that the results produced by the model are reliable and interpretable. If these assumptions are violated, predictions or classifications may be inaccurate.
Hypothesis of Linear Regression:
- Assumes a linear relationship: the relationship between the target variable (y) and the explanatory variables (x) is linear.
- Assumes errors (residuals) follow a normal distribution: the differences between actual and predicted values are normally distributed.
- Assumes homoscedasticity (constant variance of errors): the variance of the residuals (differences between predictions and actual values) remains constant across all values of x.
Hypotheses of Logistic Regression
- Assumes that the data are independent of each other.
- Assumes the absence of multicollinearity.
- Assumes the model is sensitive to imbalanced classes: if one class strongly dominates, the model can and may ignore the minority class.
2) Sensitive to Outliers:
Both methods are sensitive to outliers, but the impact is more pronounced in linear regression.
In linear regression:
- Ex: A house with an abnormally high price (for example, a luxury villa in an average neighborhood) can distort the slope of the regression line, as the least squares method will try to minimize the error for this point.
- Impact: The model will try to minimize the error for this very expensive house by adjusting the slope and the intercept of the line. This can make the predictions for all “normal” houses less accurate because the line will be disproportionately influenced by this outlier.
Outliers bias the model coefficients and reduce its ability to generalize correctly to other data. This is why it is often necessary to detect and handle outliers (e.g., by removing or transforming them).
In logistic regression:
- Ex: An observation with a probability very close to 0 or 1, but erroneous, can distort the model’s fit. Suppose we are trying to predict whether a transaction is fraudulent or not (binary). If a transaction has extreme values (e.g., an unusual amount of 10 million euros for a normal customer).
- Impact: The model may adjust its probabilities for this single transaction at the expense of other transactions, thereby distorting the overall predictions.
Outliers bias the model coefficients and reduce its ability to generalize correctly to other data. This is why it is often necessary to detect and handle outliers (e.g., by removing or transforming them).
3) Multicollinearity:
Multicollinearity occurs when two or more explanatory (independent) variables in a model are strongly correlated with each other. This makes it difficult for the model to isolate the individual effect of each variable on the target variable, thus disturbing the results of both regression and logistic models.
- For logistic regression, it will reduce the ability to predict and correctly distinguish the classes.
- For linear regression, it decreases the accuracy of the estimates and obscures the interpretation of the coefficients.
Example:
Imagine you are trying to predict the price of a house. Two of your explanatory variables are:
-The living area,
-The total lot size.
These two variables are often strongly correlated, as a large lot size generally implies a large living area. Multicollinearity complicates the interpretation of coefficients because the model has difficulty determining which of the two variables is truly responsible for the price increase.
- Unstable coefficients: When explanatory variables are correlated, the model cannot distinguish which variable is responsible for the effect on the target variable. For example, for the variables “living area” and “total lot size,” which are strongly correlated, the model could assign a high weight to “living area” and a low weight to “total lot size.” The coefficients lose their reliability, making interpretation difficult.
- Difficult interpretation: It is challenging to understand the individual effect of each variable on the target.
- Biased predictions: Prediction accuracy may decrease if multicollinearity is very strong.
Conclusion
Linear and logistic regressions are powerful tools for modeling data and solving different types of problems. While linear regression excels in continuous forecasting, logistic regression is essential for classification tasks. Mastering these techniques is crucial for anyone wishing to explore data analysis, statistics, or machine learning.
If you are a beginner, start with simple cases (such as a univariate linear regression, which models a linear relationship between a single explanatory variable and a single target variable), then progress to more complex models. By experimenting with these approaches, you will be better equipped to tackle a variety of problems in fields ranging from finance to healthcare
Sirine Amrane