MLCC: Gradient descent

I am working through Google’s Machine Learning Crash Course. The notes in this post cover [2].

Earlier, I explored simplistic linear regression, largely based on [1]. The next section of the crash course ([2]) dives into “gradient descent” (GD), which raises the question “What’s wrong with the linear regression we just learned?” In short, the technique we just learned, Ordinary Least Squares (OLS), does not scale.

[3] clarifies linear regression can take a few forms depending input and processing constraints. Among these forms, OLS concerns one or more inputs where “all of the data must be available and you must have enough memory to fit the data and perform matrix operations” and uses least squares to find the best line. GD concerns “a very large dataset either in the number of rows or the number of columns that may not fit into memory.” As described by [4], OLS doesn’t scale. GD scales by finding a “numerical approximation … by iterative method”.

[2] introduces GD by descending a parabola, but it’s unclear how we transitioned from talking about straight lines in [1] to parabolas. The distinction is that we’re now focusing on loss functions. (To be fair, in retrospect, the title is “Reducing loss”🤦‍♂️) [2] asserts “For the kind of regression problems we’ve been examining, the resulting plot of loss vs. w1 will always be convex”, ie a parabola. OLS takes all the data and computes an optimal line, but GD iteratively generates lines and determines whether one is optimal by comparing the loss to the previous iteration.

[1] introduced the idea of quantifying the accuracy of a regression by calculating the loss. For example, it mentioned Mean Squared Error as a common loss function. [5] clarifies that Mean Squared Error is an exponential function. This provides helpful context for [2]’s definition of “gradient” as the derivative of the loss function.

I like the summary statement from [5]

The goal of any Machine Learning Algorithm is to minimize the Cost Function

[5] uses the interactive exercise from [2]. It’s reassuring to see convergence 😉

[4] presents a good example of a team trying to find the highest peak in a mountainous area by parachuting randomly over the range and reporting their local max daily. I can see how that would scale well for a large data set. Reminds me of MapReduce.

This example is a bit counter-intuitive, though, in that GD is trying to find a minimum (loss) rather than a maximum. It’d be better phrased as trying to find the deepest valley. Anyway, it states “Our aim is to reach the minima which is the valley bottom. So our gradient should be negative always … So if at our initial weights, the slope is negative, we are in the right direction”, which explains the “descent” in “gradient descent”.

[4] (like [2]) describes three forms of GD:

  1. Batch
  2. Stochastic
  3. Mini Batch

[2] defines “a batch” as “the total number of examples you use to calculate the gradient in a single iteration.” Presumably, it’s referring to Batch GD when it says “So far, we’ve assumed that the batch has been the entire data set.”

[2] describes Stochastic as picking one example at random for each iteration, which would take forever and may operate on redundant data, which is common in large data sets.

[2] states Mini Batch “reduces the amount of noise in SGD but is still more efficient than full-batch” because it uses batches of 10-1000 random examples, and that Mini Batch is what’s used in practice.

When do we stop iterating? [2] states “you iterate until overall loss stops changing or at least changes extremely slowly. When that happens, we say that the model has converged.”

To summarize:

  1. Initialize with arbitrary weights
  2. Generate a model
  3. Sample (labeled) examples
  4. Input sample into the model
  5. Calculate the loss
  6. Compare the new loss with the previous loss
  7. If loss is decreasing
    1. Add the step value to the weight
    2. Repeat from step 2


  1. Google Machine Learning Crash Course: “Descending into ML”
  2. Google Machine Learning Crash Course: “Reducing loss”
  3. Machine Learning Mastery: “Linear Regression for Machine Learning”
  4. Towards Data Science: “Optimization: Ordinary Least Squares Vs. Gradient Descent — from scratch”
  5. Towards Data Science: “Understanding the Mathematics behind Gradient Descent”

MLCC: Linear regression

I am working through Google’s Machine Learning Crash Course. The notes in this post cover [1] and [2].

A lot of ML quickstarts dive right into jargon like model, feature, y’, L2, etc, which makes it hard for me to learn the basics – “what are we doing and why?”

The crash course also presents some jargon, but at least explains each concept and links to a glossary, which makes it easier to learn.

After a few days of poking around, one piece of jargon seems irreducible: linear regression. In other words, this is the kind of basic ML concept I’ve been looking for. This is where I’d start if I was helping someone learn ML.

I probably learned about linear regression in the one statistics class I took in college, but have forgotten about it after years of string parsing 🙂

The glossary entry for linear regression describes it as “Using the raw output (y’) of a linear model as the actual prediction in a regression model”, which is still too dense for me.

The linear regression module of the crash course is closer to my level:

Linear regression is a method for finding the straight line … that best fits a set of points.

The crash course provides a good example of a line fitting points describing cricket chirps per minute per temperature:

Google's example of a line fitting cricket chirps by temperature

The “linear” in “linear regression” refers to this straight line, as in linear equation. The “regression” refers to “regression to the mean”, which is a statistical observation unfortunately unrelated to statistical methods like the least squares technique described below, as explained humorously by John Seymour.

Math is Fun describes a technique called “least squares regression” for finding such a line. Google’s glossary also has an entry for least squares regression, which gives me confidence that I’m bridging my level (Math is Fun) with the novel concept of ML.

Helpful tip from StatQuest’s “Machine Learning Fundamentals: Bias and Variance”: differences are squared so that negative distances don’t cancel out positive distances.

Math is Fun’s article on linear equations and the crash course’s video on linear regression reminded me of the slope-intercept form of a linear equation I learned about way back when: y = mx + b.

The crash course even describes this equation as a “model”: “By convention in machine learning, you’ll write the equation for a model slightly differently …”

All this helps me understand in the most basic sense:

  • A “model” is just an equation
  • “Training” and “learning” are just performing a regression calculation to generate an equation
  • Performing these calculations regularly and on large data sets is tedious and error prone, so we use a computer, hence “machine learning”
  • “Prediction” and “inference” are just plugging x values into the equation


  1. Google Machine Learning Crash Course: “Framing”
  2. Google Machine Learning Crash Course: “Descending into ML”