I am working through Google’s Machine Learning Crash Course. The notes in this post cover [2].
Earlier, I explored simplistic linear regression, largely based on [1]. The next section of the crash course ([2]) dives into āgradient descentā (GD), which raises the question āWhatās wrong with the linear regression we just learned?ā In short, the technique we just learned, Ordinary Least Squares (OLS), does not scale.
[3] clarifies linear regression can take a few forms depending input and processing constraints. Among these forms, OLS concerns one or more inputs where āall of the data must be available and you must have enough memory to fit the data and perform matrix operationsā and uses least squares to find the best line. GD concerns āa very large dataset either in the number of rows or the number of columns that may not fit into memory.ā As described by [4], OLS doesnāt scale. GD scales by finding a ānumerical approximation ⦠by iterative methodā.
[2] introduces GD by descending a parabola, but itās unclear how we transitioned from talking about straight lines in [1] to parabolas. The distinction is that weāre now focusing on loss functions. (To be fair, in retrospect, the title is “Reducing loss”š¤¦āāļø) [2] asserts āFor the kind of regression problems we’ve been examining, the resulting plot of loss vs. w1 will always be convexā, ie a parabola. OLS takes all the data and computes an optimal line, but GD iteratively generates lines and determines whether one is optimal by comparing the loss to the previous iteration.
[1] introduced the idea of quantifying the accuracy of a regression by calculating the loss. For example, it mentioned Mean Squared Error as a common loss function. [5] clarifies that Mean Squared Error is an exponential function. This provides helpful context for [2]ās definition of āgradientā as the derivative of the loss function.
I like the summary statement from [5]:
The goal of any Machine Learning Algorithm is to minimize the Cost Function
[5] uses the interactive exercise from [2]. Itās reassuring to see convergence š
[4] presents a good example of a team trying to find the highest peak in a mountainous area by parachuting randomly over the range and reporting their local max daily. I can see how that would scale well for a large data set. Reminds me of MapReduce.
This example is a bit counter-intuitive, though, in that GD is trying to find a minimum (loss) rather than a maximum. Itād be better phrased as trying to find the deepest valley. Anyway, it states āOur aim is to reach the minima which is the valley bottom. So our gradient should be negative always ⦠So if at our initial weights, the slope is negative, we are in the right directionā, which explains the ādescentā in āgradient descentā.
[4] (like [2]) describes three forms of GD:
- Batch
- Stochastic
- Mini Batch
[2] defines āa batchā as āthe total number of examples you use to calculate the gradient in a single iteration.ā Presumably, itās referring to Batch GD when it says āSo far, we’ve assumed that the batch has been the entire data set.ā
[2] describes Stochastic as picking one example at random for each iteration, which would take forever and may operate on redundant data, which is common in large data sets.
[2] states Mini Batch āreduces the amount of noise in SGD but is still more efficient than full-batchā because it uses batches of 10-1000 random examples, and that Mini Batch is whatās used in practice.
When do we stop iterating? [2] states āyou iterate until overall loss stops changing or at least changes extremely slowly. When that happens, we say that the model has converged.ā
To summarize:
- Initialize with arbitrary weights
- Generate a model
- Sample (labeled) examples
- Input sample into the model
- Calculate the loss
- Compare the new loss with the previous loss
- If loss is decreasing
- Add the step value to the weight
- Repeat from step 2
References
- Google Machine Learning Crash Course: “Descending into ML”
- Google Machine Learning Crash Course: “Reducing loss”
- Machine Learning Mastery: āLinear Regression for Machine Learningā
- Towards Data Science: āOptimization: Ordinary Least Squares Vs. Gradient Descent ā from scratchā
- Towards Data Science: āUnderstanding the Mathematics behind Gradient Descentā