I am working through Google’s Machine Learning Crash Course. The notes in this post cover [2].

Earlier, I explored simplistic linear regression, largely based on [1]. The next section of the crash course ([2]) dives into âgradient descentâ (GD), which raises the question âWhatâs wrong with the linear regression we just learned?â In short, the technique we just learned, Ordinary Least Squares (OLS), **does not scale**.

[3] clarifies linear regression can take a few forms depending input and processing constraints. Among these forms, OLS concerns one or more inputs where âall of the data must be available and you must have enough memory to fit the data and perform matrix operationsâ and uses least squares to find the best line. GD concerns âa very large dataset either in the number of rows or the number of columns that may not fit into memory.â As described by [4], OLS doesnât scale. GD scales by finding a ânumerical approximation âŚ by iterative methodâ.

[2] introduces GD by descending a parabola, but itâs unclear how we transitioned from talking about straight lines in [1] to parabolas. The distinction is that **weâre now focusing on loss functions**. (To be fair, in retrospect, the title is “Reducing loss”đ¤Śââď¸) [2] asserts âFor the kind of regression problems we’ve been examining, the resulting plot of loss vs. w_{1} will always be convexâ, ie a parabola. OLS takes all the data and computes an optimal line, but GD iteratively generates lines and determines whether one is optimal by comparing the loss to the previous iteration.

[1] introduced the idea of quantifying the accuracy of a regression by calculating the loss. For example, it mentioned Mean Squared Error as a common loss function. [5] clarifies that Mean *Squared* Error is an exponential function. This provides helpful context for [2]âs definition of âgradientâ as the derivative of the loss function.

I like the summary statement from [5]:

The goal of any Machine Learning Algorithm is to minimize the Cost Function

[5] uses the interactive exercise from [2]. Itâs reassuring to see convergence đ

[4] presents a good example of a team trying to find the highest peak in a mountainous area by parachuting randomly over the range and reporting their local max daily. I can see how that would scale well for a large data set. Reminds me of MapReduce.

This example is a bit counter-intuitive, though, in that GD is trying to find a minimum (loss) rather than a maximum. Itâd be better phrased as trying to find the deepest valley. Anyway, it states âOur aim is to reach the minima which is the valley bottom. So our gradient should be negative always âŚ So if at our initial weights, the slope is negative, we are in the right directionâ, which explains the âdescentâ in âgradient descentâ.

[4] (like [2]) describes three forms of GD:

- Batch
- Stochastic
- Mini Batch

[2] defines âa batchâ as âthe total number of examples you use to calculate the gradient in a single iteration.â Presumably, itâs referring to Batch GD when it says âSo far, we’ve assumed that the batch has been the entire data set.â

[2] describes Stochastic as picking one example at random for each iteration, which would take forever and may operate on redundant data, which is common in large data sets.

[2] states Mini Batch âreduces the amount of noise in SGD but is still more efficient than full-batchâ because it uses batches of 10-1000 random examples, and that Mini Batch is whatâs used in practice.

When do we stop iterating? [2] states âyou iterate until overall loss stops changing or at least changes extremely slowly. When that happens, we say that the model has converged.â

To summarize:

- Initialize with arbitrary weights
- Generate a model
- Sample (labeled) examples
- Input sample into the model
- Calculate the loss
- Compare the new loss with the previous loss
- If loss is decreasing
- Add the step value to the weight
- Repeat from step 2

## References

- Google Machine Learning Crash Course: “Descending into ML”
- Google Machine Learning Crash Course: “Reducing loss”
- Machine Learning Mastery: âLinear Regression for Machine Learningâ
- Towards Data Science: âOptimization: Ordinary Least Squares Vs. Gradient Descent â from scratchâ
- Towards Data Science: âUnderstanding the Mathematics behind Gradient Descentâ

## 3 thoughts on “Gradient descent”