# Regularization

I amÂ working through Googleâ€™s Machine Learning Crash Course. The notes in this post cover theÂ â€śRegularizationâ€ť module.

An earlier module focused on generalization (notes). A â€śgeneralization curveâ€ť visualizes generalization by showing loss for training data vs loss for validation data.

When training loss is less than validation loss, weâ€™re â€śoverfittingâ€ť to the training data, reducing generalization.

â€śRegularizationâ€ť is the process of preventing overfitting. The TensorFlow docs also discuss regularization.

â€śEmpirical risk minimizationâ€ť refers to loss reduction using tools like gradient descent (notes).

â€śStructural risk minimizationâ€ť refers to regularization by minimizing the complexity of the model.

The â€śL2 regularizationâ€ť formula quantifies complexity as the sum of the squares of the feature weights.

â€śLambdaâ€ť aka â€śregularization rateâ€ť governs the amount of regularization applied. Increasing lambda strengthens regularization, resulting in a steeper histogram of weights, for example. A tool called Vizier can help optimize lambda.

Helpful phrasing from StatQuestâ€™s “Machine Learning Fundamentals: Bias and Variance”: regularization is one technique for finding a balance between a simple model (that may have high bias) and a complex model (that may have high variability).

## Exercise 1

The answer for task 1 in the first exercise, notes the â€śrelative weightâ€ť of lines from FEATURE to OUTPUT in the playground. What is “relative weight”? đź¤” Later, the second exercise mentions â€śThe relative thickness of each line running from FEATURES to OUTPUT represents the learned weight for that feature or feature cross. You can find the exact weight values by hovering over each line.â€ť So, â€śrelative weightâ€ť in this context is just referring to the weight of one line relative to another, rather than a novel concept.

The answer for task 1 states: â€śThe lines emanating from X1 and X2 are much thicker than those coming from the feature crosses. So, the feature crosses are contributing far less to the model than the normal (uncrossed) features.â€ť Task 2 states â€śIf we use a model that is too complicated, such as one with too many crosses …â€ť Later, we learn â€śIf model complexity is a function of weights …â€ť Is complexity a function of crosses or weights? đź¤”  I guess the idea is that the additional complexity of the crosses was driving up the weight of the uncrossed features, irrespective of regularization. Running the playground with and without the cross supports this, eg ~1.5, 0.131 and 0.033, respectively, vs ~0.9 with losses 0.096 and 0.039. Running with the cross and 0.3 regularization results in ~0.3, 0.092 and 0.059. Running with just 0.3 regularization results in ~0.3, 0.093 and 0.061. So it would seem there are at least a couple, orthogonal components to â€ścomplexityâ€ť.

## Exercise 2

An answer in the playground mentions: â€śWhile test loss decreases, training loss actually increases. This is expected, because you’ve added another term to the loss function to penalize complexity.â€ť đź¤”  I think this is referring to the literal addition of the complexity term in the calculation to find a weight ( `minimize(loss(data|model)) + complexity(model)`).