auc-roc, log loss, f1-score, mcc in classification: which performance metric should you choose?

Sirine Amrane
3 min readFeb 1, 2025

--

binary and multi-class classification are at the heart of many real-world applications, from fraud detection to medical diagnosis and cybersecurity. choosing the right evaluation metric can make a huge difference in how you interpret your model’s performance.

some metrics focus on probability calibration, while others handle imbalanced datasets better. The right choice depends on the problem at hand, and we’ll break it down step by step before summarizing everything in a table.

1. log loss (cross-entropy loss or logarithm loss)

log loss measures how confident a classification model is in its probabilistic predictions. it calculates the gap between predicted probabilities and actual outcomes. unlike accuracy, which ignores confidence levels, log loss heavily penalizes incorrect predictions when the model is overly confident.

goal:

  • the lower the log loss, the better, because it means the model is more confident in its correct predictions.

when to use it?

  • during model training to optimize probability outputs.

when not to use it?

  • when comparing final model performance. it’s not ideal for model selection.

2. ROC-AUC (receiver operating characteristic — area under curve)

roc-auc measures how well a model separates positive and negative classes. the roc curve plots the true positive rate (tpr) against the false positive rate (fpr) across different decision thresholds.

goal:

  • auc = 1 → perfect model, completely separates the classes
  • auc = 0.5 → random model
  • auc < 0.5 → worse than random classification

when to use it?

  • moderately imbalanced datasets
  • when you need to find an optimal decision threshold (e.g., 0.5, 0.7…)
  • when you want to compare multiple probability-based models without worrying about a fixed threshold

when not to use it?

  • highly imbalanced datasets. roc-auc can be misleading, so pr-auc (precision-recall auc) is often a better choice.

3. f1-score

f1-score is the harmonic mean of precision and recall. it’s useful when both false positives and false negatives are costly, like in fraud detection or spam filtering.

goal:

  • f1 = 1 → perfect balance between precision and recall
  • f1 = 0 → terrible classification

when to use it?

  • when the dataset is moderately imbalanced
  • when both precision and recall matter equally
  • when false positives and false negatives are equally problematic

when not to use it?

  • when probability-based evaluation is needed (e.g., if you want to adjust a decision threshold).

4. matthews correlation coefficient (MCC)

mcc is a lesser-known but powerful metric, especially for imbalanced datasets. it considers all elements of the confusion matrix (tp, tn, fp, fn) and gives a more balanced evaluation.

goal:

  • mcc = 1 → perfect classification with balance across tp, fp, tn, fn
  • mcc = 0 → random classification
  • mcc < 0 → worse than random classification

when to use it?

  • when the dataset is extremely imbalanced and accuracy alone is misleading

when not to use it?

  • when the dataset is well-balanced and accuracy is already a good enough measure

recap

so, which one should you use?

the right metric depends on what you’re optimizing for. if your model outputs probabilities, log loss and roc-auc are helpful. if you’re dealing with an imbalanced dataset, f1-score or mcc will give you a better picture.

choosing the wrong metric can make your model look better than it actually is. if you’re working on fraud detection, medical diagnostics, or cybersecurity, understanding these differences is essential. no single metric is perfect for every situation, so pick the one that aligns with your goals.

hope this helps make things clearer!

Sirine Amrane

--

--

Sirine Amrane
Sirine Amrane

No responses yet