I amÂ working through Googleâ€™s Machine Learning Crash Course. The notes in this post cover theÂ â€śClassificationâ€ťÂ module.

New metrics for evaluating classification performance:

- Accuracy
- Precision
- Recall
- ROC
- AUC

## Accuracy

“Accuracy” simply measures percentage of correct predictions.

It fails on class-imbalance, aka â€śskewed classâ€ť, problems, though. Neptune AI states is bluntly: â€śYou shouldnâ€™t use accuracy on imbalanced problems.â€ť Heuristic: is the percent accuracy > the imbalance? For example, if a population is 99% disease-free, an accuracy of 99% requires no intelligence. This is called the â€śaccuracy paradoxâ€ť. Precision and recall are better suited to class-imbalance problems.

Tip: calculate odds independently if possible to compare with accuracy.

## Confusion matrix

A â€śconfusion matrixâ€ť, aka â€śclassification matrixâ€ť, quantifies predicted vs actual outcomes, which is useful for evaluating model performance.

A false positive is a â€śtype oneâ€ť error. A false negative is a â€śtype twoâ€ť error. When the cost of error is high, type two must be minimized. In other words, when the cost of error is high, maximize recall.

## Precision and recall

Andrew Ngâ€™s â€śLecture 11.4 â€” Machine Learning System Design | Trading Off Precision And Recallâ€ť provides a helpful phrasing:

- Precision = true positive / predicted positive
- Recall = true positive / actual positive

Regarding the accuracy paradox, if a model simply predicts negative all the time (eg because 99% of email isnâ€™t spam), it will fail recall and precision because it never has a true positive.

Wikipedia makes a point: â€śIt is trivial to achieve recall of 100% by returning all documents in response to any queryâ€ť

Precision and recall are important, and in tension. Classification depends on a â€śthresholdâ€ť. Increasing the threshold increases precision, but decreases recall. Wikipedia uses surgery for a brain tumor to illustrate: a conservative approach increases the risk of false negative; an aggressive approach increases risk of false positive. Plotting the â€śprecision-recall curveâ€ť can also help demonstrate the relationship, as demonstrated by Andrew Ng.

Wikipedia has a nice visualization differentiating precision and recall:

## ROC and AUC

The “ROC curve” helps identify the best threshold.

“AUC” compares ROCs, helping identify the best model.

StatQuestâ€™s â€śROC and AUC, Clearly Explained!â€ť states precision is a better metric than the false positive rate for class imbalance problems because it doesnâ€™t take true negatives into account.

Keras gives us AUC for a model, but whatâ€™s the corresponding threshold? The crash course clarifies: â€śAUC is classification-threshold-invariant. It measures the quality of the model’s predictions irrespective of what classification threshold is chosen.â€ť Ok, then why use anything but AUC? Neptune AI summarizes: â€ś… use it when you care equally about positive and negative classes.â€ť

## Prediction bias

Seems like this is another way of quantifying model performance. If we know a probability of occurrence and the model produces a significantly different probability, that indicates somethingâ€™s amiss.

The formal definition is: average predicted occurrence – average actual occurrence. Thereâ€™s a helpful note that a model simply returning the average occurrence would have zero prediction bias, but would still be a bad model.

The crash course gives a few causes for bias. StatQuestâ€™s â€śMachine Learning Fundamentals: Bias and Varianceâ€ť adds another: the inability of a ML algorithm to capture the true relationship between features and labels, eg linear regression trying to capture a curved relationship.

Fix prediction bias in the model, rather than adjusting the model output.

Interesting clarification that predicted values are a probability range, but actual values are discrete, so we need to segment values and average them to make a comparison.