Summarize Evaluation Model
machine learning
supervised learning
evaluation
Key Concepts: TP, FP, FN, TN
For a given class (say “Road”):
- True positives (TP): Predicted positive and are actually positive
- False positives (FP): Predicted positive and are actually negative
- True negatives (TN): Predicted negative and are actually negative
- False negatives (FN): Predicted negative and are actually positive
| Term | Meaning | Example (for class = Road) |
|---|---|---|
| TP (True Positive) | Pixel truly Road, predicted Road | A ✅ |
| FP (False Positive) | Pixel not Road, but predicted Road | D ❌ |
| FN (False Negative) | Pixel Road, but predicted something else | B ❌ |
| TN (True Negative) | Pixel not Road, predicted not Road | C ✅ (for “Road” class) |
Metrics Explained (with intuition)
Accuracy
Meaning: Fraction of correctly classified pixels over all pixels.
Formula: Accuracy= (TP+TN)/(TP+TN+FP+FN)
Interpretation:
- How many pixels the model got right overall.
- Can be misleading if some classes dominate (e.g. background pixels).
Layman Example: If 95% of your image is background and the model always predicts “background,” accuracy will be 95%, even if it misses all buildings or roads.
| Measures | Best When | Sensitive To |
|---|---|---|
| Overall correctness | Balanced datasets | Dominant background |
Precision
Meaning: Of all pixels predicted as this class, how many are actually correct? and it focuses on the false positives.
Formula: Precision=TP/(TP+FP)
Interpretation:
- High precision means few false alarms.
- The model predicts a class only when it is confident.
Layman Example: If you predict 100 “road” pixels and 90 are truly roads, precision = 0.9.
| Measures | Best When | Sensitive To |
|---|---|---|
| How precise predictions are | You want few false positives (false alarms) | False Positives (FP) |
Recall (Sensitivity)
Meaning: Of all true pixels of this class, how many did the model correctly find? and it focuses on the false negatives.
Formula: Recall=TP/(TP+FN)
Interpretation:
- High recall means most real objects are detected.
- May include more incorrect predictions.
Layman Example: If there are 100 true “road” pixels and you correctly predict 90 of them, recall = 0.9.
| Measures | Best When | Sensitive To |
|---|---|---|
| How complete predictions are | You want full detection | False Negatives (FN) |
F1 Score
Meaning: Balanced measure combining precision and recall.
Formula: F1 = 2×(Precision×Recall)/(Precision+Recall)
Interpretation:
- High only when both precision and recall are high.
- Drops sharply if either precision or recall is low.
- Useful as a single overall metric. Good overall measure when you need both accuracy and completeness.
Layman Example:
- Precision = 0.9, Recall = 0.9 → F1 = 0.9
- Precision = 0.9, Recall = 0.5 → F1 = 0.64 — big drop.
| Measures | Best When | Sensitive To |
|---|---|---|
| Balance of precision/recall | General performance | Both FP + FN |
IoU (Intersection over Union)
Meaning: How much overlap is there between predicted and actual pixels for a class?
Formula: IoU = TP/(TP+FP+FN)
Interpretation:
- Compares overlap between predicted mask and ground truth mask.
- IoU is the overlap area ÷ combined area.
- Higher IoU → better overlap → more accurate mask.
Layman Example: If your model’s “road” mask overlaps 80% with the true “road” mask, IoU = 0.8.
| Measures | Best When | Sensitive To |
|---|---|---|
| Overlap area | General segmentation quality assessment | Missing overlap (FP and FN) |
mIoU (Mean Intersection over Union)
Meaning: The average IoU across all classes. Formula: 𝑚𝐼𝑜𝑈=1𝑁∑𝑖=1𝑁𝐼𝑜𝑈𝑖
Interpretation:
- Compute IoU separately for each class (road, building, vegetation, etc.).
- Take the average across all classes.
- Standard benchmark metric in semantic segmentation.
Layman Example: If IoU for road = 0.8, vegetation = 0.7, building = 0.6 → mIoU = 0.7.
| Measures | Best When | Sensitive To |
|---|---|---|
| Average overlap across classes | Multi-class balance | Class imbalance |
Dice Coefficient (a.k.a. F1 Score for segmentation)
Meaning: Measures how well the predicted mask overlaps the true mask, giving more weight to overlap than IoU.
Formula: Dice=2TP/(2TP+FP+FN)
Interpretation:
- Equivalent to F1 score at the pixel level.
- More sensitive to overlap than IoU.
- Especially effective for small or imbalanced objects.
Layman Example:
- If predicted and true masks mostly overlap, Dice is close to 1.
- If they barely overlap, Dice approaches 0.
| Measures | Best When | Sensitive To |
|---|---|---|
| Weighted overlap | Small object detection | Small-class errors |
More Deep explain for Metrics
| Measure / Plot | Guidance | Explanation |
|---|---|---|
| Discrimination | ||
| AUROC | ✅ | Quantifies discrimination, which is a key component of statistical model performance. |
| AUPRC, PAUROC | ❌ | These measures attempt to move beyond a statistical assessment, but violate decision-analytic principles. |
| ROC curve, PR curve | - | These plots provide limited additional information over AUROC. |
| Calibration | ||
| O:E ratio | - | An interpretable measure, but only a partial assessment of calibration; essential either for internal validation, O:E ratio is often (close to) 1. |
| Calibration intercept, calibration slope | - | These measures are harder to interpret and provide a partial assessment of calibration; for internal validation, quantifying calibration slope can be used as in indication of overfitting. |
| ECI, ICI, ECE | - | These measures summarize the smoothed (or grouped in case of ECE) calibration plot, concealing the nature and direction of miscalibration. |
| Calibration plot/reliability diagram | ✅ | This is by far the most insightful approach to assess calibration, in particular when smoothing rather than grouping is used; for internal validation, a plot is preferred but merely reporting the calibration slope is acceptable; for external validation a calibration plot is strongly recommended, with indications of uncertainty, e.g. by 95% confidence intervals. |
| Overall | ||
| Loglikelihood, Brier, R2 measures | - | These proper measures are fine, yet it makes sense to conduct a separate evaluation of discrimination and calibration. Such measures are more convenient when comparing models, which was not the key focus of this work. |
| Discrimination slope, MAPE | ❌ | These measures are improper, which means that incorrect models can have better values for these measures than the correct model. |
| Risk distribution plots | ✅ | Displaying the distribution of the risk estimates for each outcome category provides valuable insights into a model’s behavior. |
| Classification | ||
| Classification accuracy, balanced accuracy, Youden index, DOR, kappa, F1, MCC | ❌ | These measures are improper at clinically relevant decision thresholds; in addition, some measures are hard to interpret. |
| Sensitivity/recall and specificity | - | While improper on their own, they can be reported descriptively if reported together. However, largely theoretical measures as they condition on the outcome that is predicted. |
| PPV/precision and NPV | - | While improper on their own, they can be reported descriptively if reported together. PPV and NPV are more practical measures because they condition on the classification. |
| Classification plots | - | Classification plots plot could be presented descriptively, showing either sensitivity and specificity or PPV and NPV by threshold. |
| Clinical Utility | ||
| NB or standardized NB (with a decision curve), EC (with a cost curve) | ✅ | Important measures to quantify to what extent better decisions are made. Decision curves of NB allow one to show potential clinical utility at various clinically relevant decision thresholds relative to default decisions (and competing models). |