I amÂ working through Googleâs Machine Learning Crash Course. The notes in this post cover theÂ âRegularizationâ module.

An earlier module focused on generalization (notes). A âgeneralization curveâ visualizes generalization by showing loss for training data vs loss for validation data.

When training loss is less than validation loss, weâre âoverfittingâ to the training data, reducing generalization.

âRegularizationâ is the process of preventing overfitting. The TensorFlow docs also discuss regularization.

âEmpirical risk minimizationâ refers to loss reduction using tools like gradient descent (notes).

âStructural risk minimizationâ refers to regularization by minimizing the complexity of the model.

The âL2 regularizationâ formula quantifies complexity as the sum of the squares of the feature weights.

âLambdaâ aka âregularization rateâ governs the amount of regularization applied. Increasing lambda strengthens regularization, resulting in a steeper histogram of weights, for example. A tool called Vizier can help optimize lambda.

Helpful phrasing from StatQuestâs “Machine Learning Fundamentals: Bias and Variance”: regularization is one technique for finding a balance between a simple model (that may have high bias) and a complex model (that may have high variability).

## Exercise 1

The answer for task 1 in the first exercise, notes the ârelative weightâ of lines from FEATURE to OUTPUT in the playground. What is “relative weight”? đ¤ Later, the second exercise mentions âThe relative thickness of each line running from FEATURES to OUTPUT represents the learned weight for that feature or feature cross. You can find the exact weight values by hovering over each line.â So, ârelative weightâ in this context is just referring to the weight of one line relative to another, rather than a novel concept.

The answer for task 1 states: âThe lines emanating from X_{1} and X_{2} are much thicker than those coming from the feature crosses. So, the feature crosses are contributing far less to the model than the normal (uncrossed) features.â Task 2 states âIf we use a model that is too complicated, such as one with too many crosses …â Later, we learn âIf model complexity is a function of weights …â Is complexity a function of crosses or weights? đ¤ I guess the idea is that the additional complexity of the crosses was driving up the weight of the uncrossed features, irrespective of regularization. Running the playground with and without the cross supports this, eg ~1.5, 0.131 and 0.033, respectively, vs ~0.9 with losses 0.096 and 0.039. Running with the cross and 0.3 regularization results in ~0.3, 0.092 and 0.059. Running with just 0.3 regularization results in ~0.3, 0.093 and 0.061. So it would seem there are at least a couple, orthogonal components to âcomplexityâ.

## Exercise 2

An answer in the playground mentions: âWhile test loss decreases, training loss actually increases. This is expected, because you’ve added another term to the loss function to penalize complexity.â đ¤ I think this is referring to the literal addition of the complexity term in the calculation to find a weight ( `minimize(loss(data|model)) + complexity(model)`

).