from AUC-ROC to optimal threshold selection

5 min readJan 28, 2025

AUC-ROC is a critical metric for evaluating the performance of binary and sometimes multiclass classifiers, particularly in domains with imbalanced datasets. Iin this article, we’ll explore how to maximize AUC-ROC performance with practical strategies

1) Why AUC-ROC matters

AUC-ROC evaluates a model’s ability to distingish between classes across all possible threshold. It discriminates between positive and negative classes across various decision thresholds. The curve plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at different thresholds:

-TPR : TP / TP + FN

-FPR: FP / FP + TN

AUC represents the area under the ROC curve. It ranges as follows:

1.0: A perfect model.
0.5: A random model (no ability to discriminate between classes).
< 0.5: An inverted model (worse than random guessing)

3)How to cumput and interpret AUC-ROC

Output : a curve

Analyse :

The ROC curve is plotted with the gray line representing a random classifier (AUC = 0.5).
The calculated AUC score is displayed in the legend to validate the model’s performance.

4) The limitation of AUC-ROC and why we need a treshold

The classification threshold determines the probability above which a model considers an instance as belonging to the positive class. By default, many models use a threshold of 0.5, but this choice is not always optimal, especially for imbalanced datasets or in contexts where the cost of errors varies.

AUC-ROC measures how well a model separates classes across all thresholds, but it doesn’t tell us which threshold to use for classification. In real-world applications, we must choose a specific/adjusted threshold to make decisions

How to adjust the optimal threshold?

Since 0.5 isn’t always optimal, we need to adjust the treshold using metrics like F1-Score, G — Mean and cost-based optimizaton to align with real-world needs

The optimal threshold depends on the problem’s context and the priorities between:
a) True Positives (TP)
b) False Positives (FP)
c) False Negatives (FN).

Optimal result to seek

Maximize the F1-score (a trade-off between precision and recall).
Maximize the G-Mean (optimization of true positive and true negative rates).
Minimize the total cost (considering critical errors: false negatives and false positives).

5) Finding the optimal classification threshold

Steps of the approach : dense grid of thresholds with all metrics

Generate a dense grid of thresholds: 1000 values between 0 and 1
Calculate metrics for each threshold: F1-Score, Precision, Recall
Calculate the G-Mean (geometric mean between TPR (sensitivity) and TNR (specificity))
Calculate the total cost of errors (weighting FN and FP with real costs)
Identify the optimal threshold for each criterion

Example of complete multi-metric automation with a dense grid of thresholds:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import f1_score, precision_score, recall_score

# Exemple de données simulées : vérités terrain et probabilités prédites
np.random.seed(42)
y_true = np.random.choice([0, 1], size=1000, p=[0.9, 0.1])  # Données déséquilibrées
y_pred_proba = np.random.rand(1000)  # Probabilités aléatoires entre 0 et 1

# Grille dense de seuils
dense_thresholds = np.linspace(0, 1, 1000)

# Initialisation des scores
f1_scores = []
precision_scores = []
recall_scores = []
g_means = []
costs = []

# Coût des erreurs (ex: finance, médecine)
cost_fn = 500  # Coût d'un faux négatif
cost_fp = 5    # Coût d'un faux positif

# Calcul des métriques pour chaque seuil
for threshold in dense_thresholds:
    y_pred = [1 if prob >= threshold else 0 for prob in y_pred_proba]
    
    # Calcul des métriques
    f1 = f1_score(y_true, y_pred, zero_division=1)
    precision = precision_score(y_true, y_pred, zero_division=1)
    recall = recall_score(y_true, y_pred, zero_division=1)
    
    # Calcul de G-mean
    tn = sum((y_true[i] == 0 and y_pred[i] == 0) for i in range(len(y_true)))
    fp = sum((y_true[i] == 0 and y_pred[i] == 1) for i in range(len(y_true)))
    fn = sum((y_true[i] == 1 and y_pred[i] == 0) for i in range(len(y_true)))
    tp = sum((y_true[i] == 1 and y_pred[i] == 1) for i in range(len(y_true)))
    
    tpr = tp / (tp + fn) if (tp + fn) > 0 else 0  # Sensibilité
    tnr = tn / (tn + fp) if (tn + fp) > 0 else 0  # Spécificité
    
    g_mean = np.sqrt(tpr * tnr) if tpr * tnr > 0 else 0
    
    # Calcul du coût total
    total_cost = (fn * cost_fn) + (fp * cost_fp)
    
    # Stockage des valeurs
    f1_scores.append(f1)
    precision_scores.append(precision)
    recall_scores.append(recall)
    g_means.append(g_mean)
    costs.append(total_cost)

# Sélection des seuils optimaux selon chaque critère
best_f1_threshold = dense_thresholds[np.argmax(f1_scores)]
best_gmean_threshold = dense_thresholds[np.argmax(g_means)]
best_cost_threshold = dense_thresholds[np.argmin(costs)]

# Tracé des métriques en fonction du seuil
plt.figure(figsize=(10, 6))
plt.plot(dense_thresholds, f1_scores, label='F1-score', color='orange')
plt.plot(dense_thresholds, precision_scores, label='Precision', color='blue')
plt.plot(dense_thresholds, recall_scores, label='Recall', color='red')
plt.plot(dense_thresholds, g_means, label='G-mean', color='green')

# Affichage des seuils optimaux
plt.axvline(x=best_f1_threshold, color='orange', linestyle='--', label=f'Optimal F1-score = {best_f1_threshold:.2f}')
plt.axvline(x=best_gmean_threshold, color='green', linestyle='--', label=f'Optimal G-mean = {best_gmean_threshold:.2f}')
plt.axvline(x=best_cost_threshold, color='purple', linestyle='--', label=f'Optimal Cost = {best_cost_threshold:.2f}')

plt.xlabel('Threshold')
plt.ylabel('Score')
plt.title('Optimal Threshold Selection - Top 0.1% Engineer Method')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

# Affichage des résultats optimaux
best_f1_threshold, best_gmean_threshold, best_cost_threshold

Output : metric curves (F-Score, Accuracy, Recall)

Key takeaways from the results :

=> If you’re looking for a balanced approach, take 0.56 (F1-Score)

=> If your goal is to minimise crtical erreors, take 0.01 (Cost-based)

Conclusion

AUC-ROC is essential for evaluating a model’s ability to distinguish between classes, but real-world applications require selecting an optimal threshold. Metrics like F1-score balance precision and recall, G-Mean handles imbalanced datasets, and cost-based optimization minimizes critical errors.

By combining AUC-ROC with threshold optimization, you can align your model’s performance with specific goals, ensuring effective and practical decision-making in any application.

Sirine Amrane