# MLCC: Classification

I am working through Google’s Machine Learning Crash Course. The notes in this post cover the “Classification” module.

New metrics for evaluating classification performance:

• Accuracy
• Precision
• Recall
• ROC
• AUC

## Accuracy

“Accuracy” simply measures percentage of correct predictions.

It fails on class-imbalance, aka “skewed class”, problems, though. Neptune AI states is bluntly: “You shouldn’t use accuracy on imbalanced problems.” Heuristic: is the percent accuracy > the imbalance? For example, if a population is 99% disease-free, an accuracy of 99% requires no intelligence. This is called the “accuracy paradox”. Precision and recall are better suited to class-imbalance problems.

Tip: calculate odds independently if possible to compare with accuracy.

## Confusion matrix

A “confusion matrix”, aka “classification matrix”, quantifies predicted vs actual outcomes, which is useful for evaluating model performance.

A false positive is a “type one” error. A false negative is a “type two” error. When the cost of error is high, type two must be minimized. In other words, when the cost of error is high, maximize recall.

## Precision and recall

Andrew Ng’s “Lecture 11.4 — Machine Learning System Design | Trading Off Precision And Recall” provides a helpful phrasing:

• Precision = true positive / predicted positive
• Recall = true positive / actual positive

Regarding the accuracy paradox, if a model simply predicts negative all the time (eg because 99% of email isn’t spam), it will fail recall and precision because it never has a true positive.

Wikipedia makes a point: “It is trivial to achieve recall of 100% by returning all documents in response to any query”

Precision and recall are important, and in tension. Classification depends on a “threshold”. Increasing the threshold increases precision, but decreases recall. Wikipedia uses surgery for a brain tumor to illustrate: a conservative approach increases the risk of false negative; an aggressive approach increases risk of false positive. Plotting the “precision-recall curve” can also help demonstrate the relationship, as demonstrated by Andrew Ng.

## ROC and AUC

The “ROC curve” helps identify the best threshold.

“AUC” compares ROCs, helping identify the best model.

StatQuest’s “ROC and AUC, Clearly Explained!” states precision is a better metric than the false positive rate for class imbalance problems because it doesn’t take true negatives into account.

Keras gives us AUC for a model, but what’s the corresponding threshold? The crash course clarifies: “AUC is classification-threshold-invariant. It measures the quality of the model’s predictions irrespective of what classification threshold is chosen.” Ok, then why use anything but AUC? Neptune AI summarizes: “… use it when you care equally about positive and negative classes.”

## Prediction bias

Seems like this is another way of quantifying model performance. If we know a probability of occurrence and the model produces a significantly different probability, that indicates something’s amiss.

The formal definition is: average predicted occurrence – average actual occurrence. There’s a helpful note that a model simply returning the average occurrence would have zero prediction bias, but would still be a bad model.

The crash course gives a few causes for bias. StatQuest’s “Machine Learning Fundamentals: Bias and Variance” adds another: the inability of a ML algorithm to capture the true relationship between features and labels, eg linear regression trying to capture a curved relationship.

Fix prediction bias in the model, rather than adjusting the model output.

Interesting clarification that predicted values are a probability range, but actual values are discrete, so we need to segment values and average them to make a comparison.