MLCC: Regularization for sparsity

I am working through Googleâ€™s Machine Learning Crash Course. The notes in this post cover the â€śRegularization for Sparsityâ€ť module.

Best-practice: if youâ€™re overfitting, you want to regularize.

“Convex Optimization” by Boyd and Vandenberghe, linked from multiple glossary entries, touches on many of the points made by the crash course:

• â€śA problem is sparse if each constraint function depends on only a small number of the variablesâ€ť
• â€śLike least-squares or linear programming, there are very effective algorithms that can reliably and efficiently solve even large convex problemsâ€ť, which would explain why gradient descent is a tool we use
• Regularization is when â€śextra terms are added to the cost functionâ€ť
• “If the problem is sparse, or has some other exploitable structure, we can often solve problems with tens or hundreds of thousands of variables and constraint”, so it would seem performance is another motivation for regularization

Ideally, we could perform L0 normalization, but thatâ€™s non-convex, and so, NP-hard (slide 7). (I like Math is Fun’s NP-complete pageđź™‚ As noted wrt gradient descent, we need a convex loss curve to optimize. L1 approximates L0 and is easy to compute.

Quora provides a couple intuitive explanations for L1 and L2 norms: â€śL2 norm there yields Euclidean distance â€¦ The L1 norm gives rise to what can be referred to as the “taxi-cab” distanceâ€ť

Rorasa’s blog states â€śNorm may come in many forms and many names, including these popular name: Euclidean distance, Mean-squared Error, etc â€¦ Because the lack of l0-normâ€™s mathematical representation, l0-minimisation is regarded by computer scientist as an NP-hard problem, simply says that itâ€™s too complex and almost impossible to solve. In many case, l0-minimisation problem is relaxed to be higher-order norm problem such as l1-minimisation and l2-minimisation.â€ť

The glossary summarizes:

• L1 regularization â€śpenalizes weights in proportion to the sum of the absolute values of the weights. In models relying on sparse features, L1 regularization helps drive the weights of irrelevant or barely relevant features to exactly 0â€ť
• L2 regularization â€śpenalizes weights in proportion to the sum of the squares of the weights. L2 regularization helps drive outlier weights (those with high positive or low negative values) closer to 0 but not quite to 0â€ť