Regularization for sparsity

I am working through Google’s Machine Learning Crash Course. The notes in this post cover the â€śRegularization for Sparsity” module.

Best-practice: if you’re overfitting, you want to regularize.

“Convex Optimization” by Boyd and Vandenberghe, linked from multiple glossary entries, touches on many of the points made by the crash course:

  • “A problem is sparse if each constraint function depends on only a small number of the variables”
  • “Like least-squares or linear programming, there are very effective algorithms that can reliably and efficiently solve even large convex problems”, which would explain why gradient descent is a tool we use
  • Regularization is when “extra terms are added to the cost function”
  • “If the problem is sparse, or has some other exploitable structure, we can often solve problems with tens or hundreds of thousands of variables and constraint”, so it would seem performance is another motivation for regularization

Ideally, we could perform L0 normalization, but that’s non-convex, and so, NP-hard (slide 7). (I like Math is Fun’s NP-complete page🙂 As noted wrt gradient descent, we need a convex loss curve to optimize. L1 approximates L0 and is easy to compute.

Quora provides a couple intuitive explanations for L1 and L2 norms: “L2 norm there yields Euclidean distance … The L1 norm gives rise to what can be referred to as the “taxi-cab” distance”

Rorasa’s blog states “Norm may come in many forms and many names, including these popular name: Euclidean distance, Mean-squared Error, etc … Because the lack of l0-norm’s mathematical representation, l0-minimisation is regarded by computer scientist as an NP-hard problem, simply says that it’s too complex and almost impossible to solve. In many case, l0-minimisation problem is relaxed to be higher-order norm problem such as l1-minimisation and l2-minimisation.”

The glossary summarizes:

  • L1 regularization “penalizes weights in proportion to the sum of the absolute values of the weights. In models relying on sparse features, L1 regularization helps drive the weights of irrelevant or barely relevant features to exactly 0”
  • L2 regularization “penalizes weights in proportion to the sum of the squares of the weights. L2 regularization helps drive outlier weights (those with high positive or low negative values) closer to 0 but not quite to 0”

Logistic regression

I am working through Google’s Machine Learning Crash Course. The notes in this post cover the â€śLogistic Regression” module.

“Logistic regression” generates a probability (a value between 0 and 1). It’s also very efficient.

Note the glossary defines logistic regression as a classification model, which is weird since it has “regression” in the name. I suspect this is explained by “You can interpret the value between 0 and 1 in either of the following two ways: … a binary classification problem … As a value to be compared against a classification threshold …”

The “sigmoid” function, aka “logistic” function/transform, produces a bounded value between 0 and 1.

Note the sigmoid function is just y = 1 / 1 + e ^ - 𝞼 where 𝞼 is our usual linear equation. I suppose we’re transforming the linear output into a logistic form.

Regularization (notes) is important in logistic regression. “Without regularization, the asymptotic nature of logistic regression would keep driving loss towards 0 in high dimensions”, esp L2 regularization and stopping early.

The “logit”, aka “log-odds”, function is the inverse of the logistic function.

The loss function for logistic regression is “log loss”.

Classification

I am working through Google’s Machine Learning Crash Course. The notes in this post cover the “Classification” module.

New metrics for evaluating classification performance:

  • Accuracy
  • Precision
  • Recall
  • ROC
  • AUC

Accuracy

“Accuracy” simply measures percentage of correct predictions.

It fails on class-imbalance, aka “skewed class”, problems, though. Neptune AI states is bluntly: “You shouldn’t use accuracy on imbalanced problems.” Heuristic: is the percent accuracy > the imbalance? For example, if a population is 99% disease-free, an accuracy of 99% requires no intelligence. This is called the “accuracy paradox”. Precision and recall are better suited to class-imbalance problems.

Tip: calculate odds independently if possible to compare with accuracy.

Confusion matrix

A “confusion matrix”, aka “classification matrix”, quantifies predicted vs actual outcomes, which is useful for evaluating model performance.

A false positive is a “type one” error. A false negative is a “type two” error. When the cost of error is high, type two must be minimized. In other words, when the cost of error is high, maximize recall.

Precision and recall

Andrew Ng’s “Lecture 11.4 — Machine Learning System Design | Trading Off Precision And Recall” provides a helpful phrasing:

  • Precision = true positive / predicted positive
  • Recall = true positive / actual positive

Regarding the accuracy paradox, if a model simply predicts negative all the time (eg because 99% of email isn’t spam), it will fail recall and precision because it never has a true positive.

Wikipedia makes a point: “It is trivial to achieve recall of 100% by returning all documents in response to any query”

Precision and recall are important, and in tension. Classification depends on a “threshold”. Increasing the threshold increases precision, but decreases recall. Wikipedia uses surgery for a brain tumor to illustrate: a conservative approach increases the risk of false negative; an aggressive approach increases risk of false positive. Plotting the “precision-recall curve” can also help demonstrate the relationship, as demonstrated by Andrew Ng.

Wikipedia has a nice visualization differentiating precision and recall:

ROC and AUC

The “ROC curve” helps identify the best threshold.

“AUC” compares ROCs, helping identify the best model.

StatQuest’s “ROC and AUC, Clearly Explained!” states precision is a better metric than the false positive rate for class imbalance problems because it doesn’t take true negatives into account.

Keras gives us AUC for a model, but what’s the corresponding threshold? The crash course clarifies: “AUC is classification-threshold-invariant. It measures the quality of the model’s predictions irrespective of what classification threshold is chosen.” Ok, then why use anything but AUC? Neptune AI summarizes: “… use it when you care equally about positive and negative classes.”

Prediction bias

Seems like this is another way of quantifying model performance. If we know a probability of occurrence and the model produces a significantly different probability, that indicates something’s amiss.

The formal definition is: average predicted occurrence – average actual occurrence. There’s a helpful note that a model simply returning the average occurrence would have zero prediction bias, but would still be a bad model.

The crash course gives a few causes for bias. StatQuest’s “Machine Learning Fundamentals: Bias and Variance” adds another: the inability of a ML algorithm to capture the true relationship between features and labels, eg linear regression trying to capture a curved relationship.

Fix prediction bias in the model, rather than adjusting the model output.

Interesting clarification that predicted values are a probability range, but actual values are discrete, so we need to segment values and average them to make a comparison.

“The Rise and Fall of Getting Things Done” by Cal Newport

Cal Newport, who wrote a book I like called Deep Work, recently wrote an article “The Rise and Fall of Getting Things Done” on the decline in popularity of productivity tools, largely due to their inefficacy dealing with a steadily increasing onslaught of said things.

I’ve tried GTD and various todo apps and can relate. Inbox zero still has value for me, though I’ve had more success with regular, aggressive purging than meticulous categorization. The former has also been an effective strategy in general, eg binary prioritization.

I’ve found team autonomy to be an effective goal, and was interested to learn about a history of the term in the workplace:

[Peter] Ducker argued that autonomy would be the central feature of the new corporate world

Newport makes a distinction that autonomy doesn’t mean isolation:

Productivity, we must recognize, can never be entirely personal

I agree. Prolonged isolation is an anti-pattern, but I still think there’s value in reserving time for focus work, focusing on the things we can change, and ensuring folks have what they need to make the changes they’re tasked with.

Newport has a great insight regarding overload being caused by a lack of awareness into other’s time:

Because so much of our effort in the office now unfolds in rapid exchanges of digital messages, it’s convenient to allow our in-boxes to become an informal repository for everything we need to get done. This strategy, however, obscures many of the worst aspects of overload culture. When I don’t know how much is currently on your plate, it’s easy for me to add one more thing […] Consider instead a system that externalizes work [emphasis added]. Following the lead of software developers, we might use virtual task boards, where every task is represented by a card that specifies who is doing the work, and is pinned under a column indicating its status. With a quick glance, you can now ascertain everything going on within your team and ask meaningful questions about how much work any one person should tackle at a time

The article repeatedly reminded me of agile, which is eventually alluded to: “Following the lead of software developers … What if you began each morning with a status meeting in which your team confronts its task board? …”

I like the idea of constraining input to status meetings. It’s a challenge in practice, but worth exploring in principle.

Regularization

I am working through Google’s Machine Learning Crash Course. The notes in this post cover the “Regularization” module.

An earlier module focused on generalization (notes). A “generalization curve” visualizes generalization by showing loss for training data vs loss for validation data.

When training loss is less than validation loss, we’re “overfitting” to the training data, reducing generalization.

“Regularization” is the process of preventing overfitting. The TensorFlow docs also discuss regularization.

“Empirical risk minimization” refers to loss reduction using tools like gradient descent (notes).

“Structural risk minimization” refers to regularization by minimizing the complexity of the model.

The “L2 regularization” formula quantifies complexity as the sum of the squares of the feature weights.

“Lambda” aka “regularization rate” governs the amount of regularization applied. Increasing lambda strengthens regularization, resulting in a steeper histogram of weights, for example. A tool called Vizier can help optimize lambda.

Helpful phrasing from StatQuest’s “Machine Learning Fundamentals: Bias and Variance”: regularization is one technique for finding a balance between a simple model (that may have high bias) and a complex model (that may have high variability).

Exercise 1

The answer for task 1 in the first exercise, notes the “relative weight” of lines from FEATURE to OUTPUT in the playground. What is “relative weight”? 🤔 Later, the second exercise mentions “The relative thickness of each line running from FEATURES to OUTPUT represents the learned weight for that feature or feature cross. You can find the exact weight values by hovering over each line.” So, “relative weight” in this context is just referring to the weight of one line relative to another, rather than a novel concept.

The answer for task 1 states: “The lines emanating from X1 and X2 are much thicker than those coming from the feature crosses. So, the feature crosses are contributing far less to the model than the normal (uncrossed) features.” Task 2 states “If we use a model that is too complicated, such as one with too many crosses …” Later, we learn “If model complexity is a function of weights …” Is complexity a function of crosses or weights? 🤔  I guess the idea is that the additional complexity of the crosses was driving up the weight of the uncrossed features, irrespective of regularization. Running the playground with and without the cross supports this, eg ~1.5, 0.131 and 0.033, respectively, vs ~0.9 with losses 0.096 and 0.039. Running with the cross and 0.3 regularization results in ~0.3, 0.092 and 0.059. Running with just 0.3 regularization results in ~0.3, 0.093 and 0.061. So it would seem there are at least a couple, orthogonal components to “complexity”.

Exercise 2

An answer in the playground mentions: “While test loss decreases, training loss actually increases. This is expected, because you’ve added another term to the loss function to penalize complexity.” 🤔  I think this is referring to the literal addition of the complexity term in the calculation to find a weight ( minimize(loss(data|model)) + complexity(model)).

Feature crosses

I am working through Google’s Machine Learning Crash Course. The notes in this post cover the “Feature Crosses” section.

“Feature cross”, “feature cross product” and “synthetic feature” are synonymous. A feature cross is the cross product of two features. The nonlinearity sub-section states “The term cross comes from cross product.” Thinking of it as a Cartessian product, which the glossary mentions, helps me grok what’s going on, and why it’s helpful for the example problem where examples are clustered by quarter (to consider x-y pairs), and esp the exercise involving latitude and longitude pairs.

The video states “Linear learners use linear models”. What is a “linear model”? Given “model” is synonymous with “equation” or “function”, a “linear model” is a linear equation. For example, Brilliant’s wiki states: “A linear model is an equation …” What is a “linear learner”? The video might just be stating a fact: something that learns using a linear model is a “linear learner”. For example, Amazon SageMaker’s Linear Learner docs states “The algorithm learns a linear function”.

A “linear problem” describes a relationship that can be expressed using a straight line (to divide the input data). “Nonlinear problems” cannot be expressed this way.

While trying to figure out why the exercise used an indicator_column, I found some nice TensorFlow tutorials, eg for feature crosses. In retrospect, I see the indicator_column docs state simply “Represents multi-hot representation of given categorical column.”

Representation

I am working through Google’s Machine Learning Crash Course. The notes in this post cover the “Representation” section.

feature engineering is another topic which doesn’t seem to merit any review papers or books, or even chapters in books, but it is absolutely vital to ML success. […] Much of the success of machine learning is actually success in engineering features that a learner can understand.

Scott Locklin, in “Neglected machine learning ideas” AQI Machine Learning Mastery’s feature engineering overview

I’ve heard 80% of data science is cleaning. This section introduces a nuance: cleaning includes a step mapping raw data into a format that’s appropriate and efficient for inputting into a model. The “scrubbing” sub-section actually seems like the only thing that fits what I previously thought of as “cleaning”, eg removing human errors, addressing incomplete data, etc.

The whole section has good recommendations I can see serving as an ongoing reference. For example:

  • Good feature values should appear more than 5 or so times in a data set … avoid unique IDs
  • Keep data pure by not encoding exceptional states into a feature’s value type, eg an integer feature where -1 means undefined, aka “magic” values. Instead, use boolean flags for exceptional states.

The “Z score” scales values as follows: scaled = (value - mean) stdev. Math is Fun has a good explanation for how to derive the standard deviation, but Pandas also provides it trivially in the output from describe.

“Binning” seems similar to *-hot encoding in that we’re enabling weights for each value, although the former concerns continuous values and the latter concerns discrete values. The feature cross video supports this by referring to both in the same context.

Histograms and stats, like those output by describe, can help detect bad data.