Summarize Evaluation Model

machine learning

supervised learning

evaluation

Key Concepts: TP, FP, FN, TN

For a given class (say “Road”):

True positives (TP): Predicted positive and are actually positive
False positives (FP): Predicted positive and are actually negative
True negatives (TN): Predicted negative and are actually negative
False negatives (FN): Predicted negative and are actually positive

Term	Meaning	Example (for class = Road)
TP (True Positive)	Pixel truly Road, predicted Road	A ✅
FP (False Positive)	Pixel not Road, but predicted Road	D ❌
FN (False Negative)	Pixel Road, but predicted something else	B ❌
TN (True Negative)	Pixel not Road, predicted not Road	C ✅ (for “Road” class)

Metrics Explained (with intuition)

Accuracy

Meaning: Fraction of correctly classified pixels over all pixels.

Formula: Accuracy= (TP+TN)/(TP+TN+FP+FN)

Interpretation:

How many pixels the model got right overall.
Can be misleading if some classes dominate (e.g. background pixels).

Layman Example: If 95% of your image is background and the model always predicts “background,” accuracy will be 95%, even if it misses all buildings or roads.

Measures	Best When	Sensitive To
Overall correctness	Balanced datasets	Dominant background

Precision

Meaning: Of all pixels predicted as this class, how many are actually correct? and it focuses on the false positives.

Formula: Precision=TP/(TP+FP)

Interpretation:

High precision means few false alarms.
The model predicts a class only when it is confident.

Layman Example: If you predict 100 “road” pixels and 90 are truly roads, precision = 0.9.

Measures	Best When	Sensitive To
How precise predictions are	You want few false positives (false alarms)	False Positives (FP)

Recall (Sensitivity)

Meaning: Of all true pixels of this class, how many did the model correctly find? and it focuses on the false negatives.

Formula: Recall=TP/(TP+FN)

Interpretation:

High recall means most real objects are detected.
May include more incorrect predictions.

Layman Example: If there are 100 true “road” pixels and you correctly predict 90 of them, recall = 0.9.

Measures	Best When	Sensitive To
How complete predictions are	You want full detection	False Negatives (FN)

F1 Score

Meaning: Balanced measure combining precision and recall.

Formula: F1 = 2×(Precision×Recall)/(Precision+Recall)

Interpretation:

High only when both precision and recall are high.
Drops sharply if either precision or recall is low.
Useful as a single overall metric. Good overall measure when you need both accuracy and completeness.

Layman Example:

Precision = 0.9, Recall = 0.9 → F1 = 0.9
Precision = 0.9, Recall = 0.5 → F1 = 0.64 — big drop.

Measures	Best When	Sensitive To
Balance of precision/recall	General performance	Both FP + FN

IoU (Intersection over Union)

Meaning: How much overlap is there between predicted and actual pixels for a class?

Formula: IoU = TP/(TP+FP+FN)

Interpretation:

Compares overlap between predicted mask and ground truth mask.
IoU is the overlap area ÷ combined area.
Higher IoU → better overlap → more accurate mask.

Layman Example: If your model’s “road” mask overlaps 80% with the true “road” mask, IoU = 0.8.

Measures	Best When	Sensitive To
Overlap area	General segmentation quality assessment	Missing overlap (FP and FN)

mIoU (Mean Intersection over Union)

Meaning: The average IoU across all classes. Formula: 𝑚𝐼𝑜𝑈=1𝑁∑𝑖=1𝑁𝐼𝑜𝑈𝑖

Interpretation:

Compute IoU separately for each class (road, building, vegetation, etc.).
Take the average across all classes.
Standard benchmark metric in semantic segmentation.

Layman Example: If IoU for road = 0.8, vegetation = 0.7, building = 0.6 → mIoU = 0.7.

Measures	Best When	Sensitive To
Average overlap across classes	Multi-class balance	Class imbalance

Dice Coefficient (a.k.a. F1 Score for segmentation)

Meaning: Measures how well the predicted mask overlaps the true mask, giving more weight to overlap than IoU.

Formula: Dice=2TP/(2TP+FP+FN)

Interpretation:

Equivalent to F1 score at the pixel level.
More sensitive to overlap than IoU.
Especially effective for small or imbalanced objects.

Layman Example:

If predicted and true masks mostly overlap, Dice is close to 1.
If they barely overlap, Dice approaches 0.

Measures	Best When	Sensitive To
Weighted overlap	Small object detection	Small-class errors

More Deep explain for Metrics

Measure / Plot	Guidance	Explanation
Discrimination
AUROC	✅	Quantifies discrimination, which is a key component of statistical model performance.
AUPRC, PAUROC	❌	These measures attempt to move beyond a statistical assessment, but violate decision-analytic principles.
ROC curve, PR curve	-	These plots provide limited additional information over AUROC.

Calibration
O:E ratio	-	An interpretable measure, but only a partial assessment of calibration; essential either for internal validation, O:E ratio is often (close to) 1.
Calibration intercept, calibration slope	-	These measures are harder to interpret and provide a partial assessment of calibration; for internal validation, quantifying calibration slope can be used as in indication of overfitting.
ECI, ICI, ECE	-	These measures summarize the smoothed (or grouped in case of ECE) calibration plot, concealing the nature and direction of miscalibration.
Calibration plot/reliability diagram	✅	This is by far the most insightful approach to assess calibration, in particular when smoothing rather than grouping is used; for internal validation, a plot is preferred but merely reporting the calibration slope is acceptable; for external validation a calibration plot is strongly recommended, with indications of uncertainty, e.g. by 95% confidence intervals.

Overall
Loglikelihood, Brier, R2 measures	-	These proper measures are fine, yet it makes sense to conduct a separate evaluation of discrimination and calibration. Such measures are more convenient when comparing models, which was not the key focus of this work.
Discrimination slope, MAPE	❌	These measures are improper, which means that incorrect models can have better values for these measures than the correct model.
Risk distribution plots	✅	Displaying the distribution of the risk estimates for each outcome category provides valuable insights into a model’s behavior.

Classification
Classification accuracy, balanced accuracy, Youden index, DOR, kappa, F1, MCC	❌	These measures are improper at clinically relevant decision thresholds; in addition, some measures are hard to interpret.
Sensitivity/recall and specificity	-	While improper on their own, they can be reported descriptively if reported together. However, largely theoretical measures as they condition on the outcome that is predicted.
PPV/precision and NPV	-	While improper on their own, they can be reported descriptively if reported together. PPV and NPV are more practical measures because they condition on the classification.
Classification plots	-	Classification plots plot could be presented descriptively, showing either sensitivity and specificity or PPV and NPV by threshold.

Clinical Utility
NB or standardized NB (with a decision curve), EC (with a cost curve)	✅	Important measures to quantify to what extent better decisions are made. Decision curves of NB allow one to show potential clinical utility at various clinically relevant decision thresholds relative to default decisions (and competing models).