I am working through Google’s Machine Learning Crash Course. The notes in this post cover .

Earlier, I explored simplistic linear regression, largely based on . The next section of the crash course () dives into “gradient descent” (GD), which raises the question “What’s wrong with the linear regression we just learned?” In short, the technique we just learned, Ordinary Least Squares (OLS), does not scale.

 clarifies linear regression can take a few forms depending input and processing constraints. Among these forms, OLS concerns one or more inputs where “all of the data must be available and you must have enough memory to fit the data and perform matrix operations” and uses least squares to find the best line. GD concerns “a very large dataset either in the number of rows or the number of columns that may not fit into memory.” As described by , OLS doesn’t scale. GD scales by finding a “numerical approximation … by iterative method”.

 introduces GD by descending a parabola, but it’s unclear how we transitioned from talking about straight lines in  to parabolas. The distinction is that we’re now focusing on loss functions. (To be fair, in retrospect, the title is “Reducing loss”🤦‍♂️)  asserts “For the kind of regression problems we’ve been examining, the resulting plot of loss vs. w1 will always be convex”, ie a parabola. OLS takes all the data and computes an optimal line, but GD iteratively generates lines and determines whether one is optimal by comparing the loss to the previous iteration.

 introduced the idea of quantifying the accuracy of a regression by calculating the loss. For example, it mentioned Mean Squared Error as a common loss function.  clarifies that Mean Squared Error is an exponential function. This provides helpful context for ’s definition of “gradient” as the derivative of the loss function.

I like the summary statement from 

The goal of any Machine Learning Algorithm is to minimize the Cost Function

 uses the interactive exercise from . It’s reassuring to see convergence 😉

 presents a good example of a team trying to find the highest peak in a mountainous area by parachuting randomly over the range and reporting their local max daily. I can see how that would scale well for a large data set. Reminds me of MapReduce.

This example is a bit counter-intuitive, though, in that GD is trying to find a minimum (loss) rather than a maximum. It’d be better phrased as trying to find the deepest valley. Anyway, it states “Our aim is to reach the minima which is the valley bottom. So our gradient should be negative always … So if at our initial weights, the slope is negative, we are in the right direction”, which explains the “descent” in “gradient descent”.

 (like ) describes three forms of GD:

1. Batch
2. Stochastic
3. Mini Batch

 defines “a batch” as “the total number of examples you use to calculate the gradient in a single iteration.” Presumably, it’s referring to Batch GD when it says “So far, we’ve assumed that the batch has been the entire data set.”

 describes Stochastic as picking one example at random for each iteration, which would take forever and may operate on redundant data, which is common in large data sets.

 states Mini Batch “reduces the amount of noise in SGD but is still more efficient than full-batch” because it uses batches of 10-1000 random examples, and that Mini Batch is what’s used in practice.

When do we stop iterating?  states “you iterate until overall loss stops changing or at least changes extremely slowly. When that happens, we say that the model has converged.”

To summarize:

1. Initialize with arbitrary weights
2. Generate a model
3. Sample (labeled) examples
4. Input sample into the model
5. Calculate the loss
6. Compare the new loss with the previous loss
7. If loss is decreasing
1. Add the step value to the weight
2. Repeat from step 2