Peter Norvig summarized the value of ML from a software engineering perspective in his “Introduction to Machine Learning” for Google’s Machine Learning Crash Course:

First, it gives you a tool to reduce the time you spend programming … Second, it will allow you to customize your products, making them better for specific groups of people … And third, machine learning lets you solve problems that you, as a programmer, have no idea how to do by hand.

From my perspective, the first two can be rephrased as:

Models add a new dimension to code reuse

For a class of problems, training models scales better than hand-writing code

There’s also a fourth point linked from the bottom of the intro:

Rule #1: Donâ€™t be afraid to launch a product without machine learning

That fourth point reminds me of the “build” vs “grow” domains – until we’ve built a product that lots of people find useful, statistics-based growth tools, like large-scale AB testing, can be relatively high-cost, low-value.We might even say such optimizations only make sense once we have more users than can be efficiently contacted directly. Put another way, if we only have one user, and she says she only wants to see articles about sports, we don’t need ML to predict her interests.

I think about these four points a lot, almost like a koan. They provide a helpful anchor as I try to distill a large amount of theory into tools I can apply to the problems I’m familiar with.

Best-practice: if youâ€™re overfitting, you want to regularize.

“Convex Optimization” by Boyd and Vandenberghe, linked from multiple glossary entries, touches on many of the points made by the crash course:

â€śA problem is sparse if each constraint function depends on only a small number of the variablesâ€ť

â€śLike least-squares or linear programming, there are very effective algorithms that can reliably and efficiently solve even large convex problemsâ€ť, which would explain why gradient descent is a tool we use

Regularization is when â€śextra terms are added to the cost functionâ€ť

“If the problem is sparse, or has some other exploitable structure, we can often solve problems with tens or hundreds of thousands of variables and constraint”, so it would seem performance is another motivation for regularization

Rorasa’s blog states â€śNorm may come in many forms and many names, including these popular name: Euclidean distance, Mean-squared Error, etc â€¦ Because the lack of l_{0}-normâ€™s mathematical representation, l_{0}-minimisation is regarded by computer scientist as an NP-hard problem, simply says that itâ€™s too complex and almost impossible to solve. In many case, l_{0}-minimisation problem is relaxed to be higher-order norm problem such as l_{1}-minimisation and l_{2}-minimisation.â€ť

L_{1} regularization â€śpenalizes weights in proportion to the sum of the absolute values of the weights. In models relying on sparse features, L_{1} regularization helps drive the weights of irrelevant or barely relevant features to exactly 0â€ť

L_{2} regularization â€śpenalizes weights in proportion to the sum of the squares of the weights. L_{2} regularization helps drive outlier weights (those with high positive or low negative values) closer to 0 but not quite to 0â€ť

â€śLogistic regressionâ€ť generates a probability (a value between 0 and 1). Itâ€™s also very efficient.

Note the glossary defines logistic regression as a classification model, which is weird since it has â€śregressionâ€ť in the name. I suspect this is explained by â€śYou can interpret the value between 0 and 1 in either of the following two ways: â€¦ a binary classification problem â€¦ As a value to be compared against a classification threshold …â€ť

The â€śsigmoidâ€ť function, aka â€ślogisticâ€ť function/transform, produces a bounded value between 0 and 1.

Note the sigmoid function is just y = 1 / 1 + e ^ - đťžĽ where đťžĽ is our usual linear equation. I suppose weâ€™re transforming the linear output into a logistic form.

Regularization (notes) is important in logistic regression. â€śWithout regularization, the asymptotic nature of logistic regression would keep driving loss towards 0 in high dimensionsâ€ť, esp L_{2} regularization and stopping early.

New metrics for evaluating classification performance:

Accuracy

Precision

Recall

ROC

AUC

Accuracy

“Accuracy” simply measures percentage of correct predictions.

It fails on class-imbalance, aka â€śskewed classâ€ť, problems, though. Neptune AI states is bluntly: â€śYou shouldnâ€™t use accuracy on imbalanced problems.â€ť Heuristic: is the percent accuracy > the imbalance? For example, if a population is 99% disease-free, an accuracy of 99% requires no intelligence. This is called the â€śaccuracy paradoxâ€ť. Precision and recall are better suited to class-imbalance problems.

Tip: calculate odds independently if possible to compare with accuracy.

Confusion matrix

A â€śconfusion matrixâ€ť, aka â€śclassification matrixâ€ť, quantifies predicted vs actual outcomes, which is useful for evaluating model performance.

Regarding the accuracy paradox, if a model simply predicts negative all the time (eg because 99% of email isnâ€™t spam), it will fail recall and precision because it never has a true positive.

Wikipedia makes a point: â€śIt is trivial to achieve recall of 100% by returning all documents in response to any queryâ€ť

Precision and recall are important, and in tension. Classification depends on a â€śthresholdâ€ť. Increasing the threshold increases precision, but decreases recall. Wikipedia uses surgery for a brain tumor to illustrate: a conservative approach increases the risk of false negative; an aggressive approach increases risk of false positive. Plotting the â€śprecision-recall curveâ€ť can also help demonstrate the relationship, as demonstrated by Andrew Ng.

Keras gives us AUC for a model, but whatâ€™s the corresponding threshold? The crash course clarifies: â€śAUC is classification-threshold-invariant. It measures the quality of the model’s predictions irrespective of what classification threshold is chosen.â€ť Ok, then why use anything but AUC? Neptune AI summarizes: â€ś… use it when you care equally about positive and negative classes.â€ť

Prediction bias

Seems like this is another way of quantifying model performance. If we know a probability of occurrence and the model produces a significantly different probability, that indicates somethingâ€™s amiss.

The formal definition is: average predicted occurrence – average actual occurrence. Thereâ€™s a helpful note that a model simply returning the average occurrence would have zero prediction bias, but would still be a bad model.

The crash course gives a few causes for bias. StatQuestâ€™s â€śMachine Learning Fundamentals: Bias and Varianceâ€ť adds another: the inability of a ML algorithm to capture the true relationship between features and labels, eg linear regression trying to capture a curved relationship.

Fix prediction bias in the model, rather than adjusting the model output.

Interesting clarification that predicted values are a probability range, but actual values are discrete, so we need to segment values and average them to make a comparison.

An earlier module focused on generalization (notes). A â€śgeneralization curveâ€ť visualizes generalization by showing loss for training data vs loss for validation data.

When training loss is less than validation loss, weâ€™re â€śoverfittingâ€ť to the training data, reducing generalization.

The â€śL2 regularizationâ€ť formula quantifies complexity as the sum of the squares of the feature weights.

â€śLambdaâ€ť aka â€śregularization rateâ€ť governs the amount of regularization applied. Increasing lambda strengthens regularization, resulting in a steeper histogram of weights, for example. A tool called Vizier can help optimize lambda.

The answer for task 1 in the first exercise, notes the â€śrelative weightâ€ť of lines from FEATURE to OUTPUT in the playground. What is “relative weight”? đź¤” Later, the second exercise mentions â€śThe relative thickness of each line running from FEATURES to OUTPUT represents the learned weight for that feature or feature cross. You can find the exact weight values by hovering over each line.â€ť So, â€śrelative weightâ€ť in this context is just referring to the weight of one line relative to another, rather than a novel concept.

The answer for task 1 states: â€śThe lines emanating from X_{1} and X_{2} are much thicker than those coming from the feature crosses. So, the feature crosses are contributing far less to the model than the normal (uncrossed) features.â€ť Task 2 states â€śIf we use a model that is too complicated, such as one with too many crosses …â€ť Later, we learn â€śIf model complexity is a function of weights …â€ť Is complexity a function of crosses or weights? đź¤” I guess the idea is that the additional complexity of the crosses was driving up the weight of the uncrossed features, irrespective of regularization. Running the playground with and without the cross supports this, eg ~1.5, 0.131 and 0.033, respectively, vs ~0.9 with losses 0.096 and 0.039. Running with the cross and 0.3 regularization results in ~0.3, 0.092 and 0.059. Running with just 0.3 regularization results in ~0.3, 0.093 and 0.061. So it would seem there are at least a couple, orthogonal components to â€ścomplexityâ€ť.

Exercise 2

An answer in the playground mentions: â€śWhile test loss decreases, training loss actually increases. This is expected, because you’ve added another term to the loss function to penalize complexity.â€ť đź¤” I think this is referring to the literal addition of the complexity term in the calculation to find a weight ( minimize(loss(data|model)) + complexity(model)).

â€śFeature crossâ€ť, â€śfeature cross productâ€ť and â€śsynthetic featureâ€ť are synonymous. A feature cross is the cross product of two features. The nonlinearity sub-section states â€śThe term cross comes from cross product.â€ť Thinking of it as a Cartessian product, which the glossary mentions, helps me grok whatâ€™s going on, and why itâ€™s helpful for the example problem where examples are clustered by quarter (to consider x-y pairs), and esp the exercise involving latitude and longitude pairs.

The video states â€śLinear learners use linear modelsâ€ť. What is a â€ślinear modelâ€ť? Given â€śmodelâ€ť is synonymous with â€śequationâ€ť or â€śfunctionâ€ť, a â€ślinear modelâ€ť is a linear equation. For example, Brilliantâ€™s wiki states: â€śA linear model is an equation …â€ť What is a â€ślinear learnerâ€ť? The video might just be stating a fact: something that learns using a linear model is a â€ślinear learnerâ€ť. For example, Amazon SageMakerâ€™s Linear Learner docs states â€śThe algorithm learns a linear functionâ€ť.

A â€ślinear problemâ€ť describes a relationship that can be expressed using a straight line (to divide the input data). â€śNonlinear problemsâ€ť cannot be expressed this way.

While trying to figure out why the exercise used an indicator_column, I found some nice TensorFlow tutorials, eg for feature crosses. In retrospect, I see the indicator_column docs state simply â€śRepresents multi-hot representation of given categorical column.â€ť

feature engineering is another topic which doesnâ€™t seem to merit any review papers or books, or even chapters in books, but it is absolutely vital to ML success. [â€¦] Much of the success of machine learning is actually success in engineering features that a learner can understand.

Iâ€™ve heard 80% of data science is cleaning. This section introduces a nuance: cleaning includes a step mapping raw data into a format that’s appropriate and efficient for inputting into a model. The â€śscrubbingâ€ť sub-section actually seems like the only thing that fits what I previously thought of as â€ścleaningâ€ť, eg removing human errors, addressing incomplete data, etc.

The whole section has good recommendations I can see serving as an ongoing reference. For example:

Good feature values should appear more than 5 or so times in a data set â€¦ avoid unique IDs

Keep data pure by not encoding exceptional states into a featureâ€™s value type, eg an integer feature where -1 means undefined, aka â€śmagicâ€ť values. Instead, use boolean flags for exceptional states.

â€śBinningâ€ť seems similar to *-hot encoding in that weâ€™re enabling weights for each value, although the former concerns continuous values and the latter concerns discrete values. The feature cross video supports this by referring to both in the same context.

Histograms and stats, like those output by describe, can help detect bad data.