I amÂ working through Googleâs Machine Learning Crash Course. The notes in this post cover theÂ âClassificationâÂ module.

New metrics for evaluating classification performance:

- Accuracy
- Precision
- Recall
- ROC
- AUC

## Accuracy

“Accuracy” simply measures percentage of correct predictions.

It fails on class-imbalance, aka âskewed classâ, problems, though. Neptune AI states is bluntly: âYou shouldnât use accuracy on imbalanced problems.â Heuristic: is the percent accuracy > the imbalance? For example, if a population is 99% disease-free, an accuracy of 99% requires no intelligence. This is called the âaccuracy paradoxâ. Precision and recall are better suited to class-imbalance problems.

Tip: calculate odds independently if possible to compare with accuracy.

## Confusion matrix

A âconfusion matrixâ, aka âclassification matrixâ, quantifies predicted vs actual outcomes, which is useful for evaluating model performance.

A false positive is a âtype oneâ error. A false negative is a âtype twoâ error. When the cost of error is high, type two must be minimized. In other words, when the cost of error is high, maximize recall.

## Precision and recall

Andrew Ngâs âLecture 11.4 â Machine Learning System Design | Trading Off Precision And Recallâ provides a helpful phrasing:

- Precision = true positive / predicted positive
- Recall = true positive / actual positive

Regarding the accuracy paradox, if a model simply predicts negative all the time (eg because 99% of email isnât spam), it will fail recall and precision because it never has a true positive.

Wikipedia makes a point: âIt is trivial to achieve recall of 100% by returning all documents in response to any queryâ

Precision and recall are important, and in tension. Classification depends on a âthresholdâ. Increasing the threshold increases precision, but decreases recall. Wikipedia uses surgery for a brain tumor to illustrate: a conservative approach increases the risk of false negative; an aggressive approach increases risk of false positive. Plotting the âprecision-recall curveâ can also help demonstrate the relationship, as demonstrated by Andrew Ng.

Wikipedia has a nice visualization differentiating precision and recall:

## ROC and AUC

The “ROC curve” helps identify the best threshold.

“AUC” compares ROCs, helping identify the best model.

StatQuestâs âROC and AUC, Clearly Explained!â states precision is a better metric than the false positive rate for class imbalance problems because it doesnât take true negatives into account.

Keras gives us AUC for a model, but whatâs the corresponding threshold? The crash course clarifies: âAUC is classification-threshold-invariant. It measures the quality of the model’s predictions irrespective of what classification threshold is chosen.â Ok, then why use anything but AUC? Neptune AI summarizes: â… use it when you care equally about positive and negative classes.â

## Prediction bias

Seems like this is another way of quantifying model performance. If we know a probability of occurrence and the model produces a significantly different probability, that indicates somethingâs amiss.

The formal definition is: average predicted occurrence – average actual occurrence. Thereâs a helpful note that a model simply returning the average occurrence would have zero prediction bias, but would still be a bad model.

The crash course gives a few causes for bias. StatQuestâs âMachine Learning Fundamentals: Bias and Varianceâ adds another: the inability of a ML algorithm to capture the true relationship between features and labels, eg linear regression trying to capture a curved relationship.

Fix prediction bias in the model, rather than adjusting the model output.

Interesting clarification that predicted values are a probability range, but actual values are discrete, so we need to segment values and average them to make a comparison.