MLCC: Feature crosses

I am working through Google’s Machine Learning Crash Course. The notes in this post cover the “Feature Crosses” section.

“Feature cross”, “feature cross product” and “synthetic feature” are synonymous. A feature cross is the cross product of two features. The nonlinearity sub-section states “The term cross comes from cross product.” Thinking of it as a Cartessian product, which the glossary mentions, helps me grok what’s going on, and why it’s helpful for the example problem where examples are clustered by quarter (to consider x-y pairs), and esp the exercise involving latitude and longitude pairs.

The video states “Linear learners use linear models”. What is a “linear model”? Given “model” is synonymous with “equation” or “function”, a “linear model” is a linear equation. For example, Brilliant’s wiki states: “A linear model is an equation …” What is a “linear learner”? The video might just be stating a fact: something that learns using a linear model is a “linear learner”. For example, Amazon SageMaker’s Linear Learner docs states “The algorithm learns a linear function”.

A “linear problem” describes a relationship that can be expressed using a straight line (to divide the input data). “Nonlinear problems” cannot be expressed this way.

While trying to figure out why the exercise used an indicator_column, I found some nice TensorFlow tutorials, eg for feature crosses. In retrospect, I see the indicator_column docs state simply “Represents multi-hot representation of given categorical column.”

MLCC: Representation

I am working through Google’s Machine Learning Crash Course. The notes in this post cover the “Representation” section.

feature engineering is another topic which doesn’t seem to merit any review papers or books, or even chapters in books, but it is absolutely vital to ML success. […] Much of the success of machine learning is actually success in engineering features that a learner can understand.

Scott Locklin, in “Neglected machine learning ideas” AQI Machine Learning Mastery’s feature engineering overview

I’ve heard 80% of data science is cleaning. This section introduces a nuance: cleaning includes a step mapping raw data into a format that’s appropriate and efficient for inputting into a model. The “scrubbing” sub-section actually seems like the only thing that fits what I previously thought of as “cleaning”, eg removing human errors, addressing incomplete data, etc.

The whole section has good recommendations I can see serving as an ongoing reference. For example:

  • Good feature values should appear more than 5 or so times in a data set … avoid unique IDs
  • Keep data pure by not encoding exceptional states into a feature’s value type, eg an integer feature where -1 means undefined, aka “magic” values. Instead, use boolean flags for exceptional states.

The “Z score” scales values as follows: scaled = (value - mean) stdev. Math is Fun has a good explanation for how to derive the standard deviation, but Pandas also provides it trivially in the output from describe.

“Binning” seems similar to *-hot encoding in that we’re enabling weights for each value, although the former concerns continuous values and the latter concerns discrete values. The feature cross video supports this by referring to both in the same context.

Histograms and stats, like those output by describe, can help detect bad data.

MLCC: Generalization

I am working through Google’s Machine Learning Crash Course. The notes in this post cover [1] through [3].

The fundamental tension of machine learning is between fitting our data well, but also fitting the data as simply as possible.

A reasonable guideline: “The less complex an ML model, the more likely that a good empirical result is not just due to the peculiarities of the sample.”

[2] recommends a best-practice: divide labeled examples into “training” and “test” sets.

Never train on test data! 100% accuracy can be a symptom of that.

[3] goes further: divide labeled examples into three sets: “training”, “validation” and “test”. Simply testing against a “test” set risks overfitting to that set. Instead, iterate against the validation set, and then double-check using the test set.

A continuing impression is that TensorFlow builds in a lot of the best-practices described in this crash course. For example, splitting out a validation set and testing against it is a first-class argument to the method.

The exercise associated with [3] is interesting. First, testing against a validation set caught a bug! Second, the bug was a default sort on the latitude column; the validation set was not a random sample.


  1. Google Machine Learning Crash Course: “Generalization”
  2. Google Machine Learning Crash Course: “Training and Test Sets”
  3. Google Machine Learning Crash Course: “Validation Set”

MLCC: First Steps with TensorFlow

I am working through Google’s Machine Learning Crash Course. The notes in this post cover [2].

[2] introduces Colab, NumPy, Pandas and TensorFlow.

Colab is like a hosted Jupyter notebook and provides an easy way to play with Python ML libraries, among other things.

NumPy provides performant and user-friendly collections and operations for linear algebra.

Pandas provides tools for working with “dataframes”, which are like spreadsheets in memory.

Digression into Google Sheets

I like building on my understanding. In this context, I want to learn Colab and NumPy by using them to work with the cricket chirp data introduced in [1].

[1] used cricket chirps per minute per temperature as an example, but didn’t provide raw data. Dolbear’s Law provides an equation we can use to generate data: TC = 10 + (N60 – 40) / 7 → N60 = 7 * TC – 30

Colab and NumPy provide an easy way to use this equation:

import numpy as np

# Starts by generating temp, since chirps are dependent on temp.
# Starts at 5 because Dolbear’s formula results in a negative value below 5 degrees
temps = np.arange(5,36)

# Adds noise to avoid an obviously linear relationship.
# Copies the approach from “NumPy UltraQuick Tutorial“ linked from [2].
# Sets low of -5, which limits the minimum chirps to zero.
noise = np.random.randint(low=-5, high=5, size=36)
chirps = 7 * temps - 30 + noise

# Prints CSVs, since Google Sheets knows how to split CSVs on paste.
print(','.join([str(i) for i in temps]))
print(','.join([str(i) for i in chirps]))

Example chirps per minute:


Note this generates synthetic data for chirps per minute, but then I’ll use them to predict temperature, ie chirps is the feature and temperature is the label.

Copy the temps and chirps CSVs. In Sheets, Edit > paste special > paste comma-separated text (CSV) as columns.

To improve readability, cut the pasted content and Edit > paste special > paste transposed to convert row data to column data.

Add column headers, select everything and then Insert > Chart.

Select “Scatter chart” for the chart type. Under Customize > Series, check the trendline box. Select “Equation” for the label to get the regression equation. Check the R2 box.

We can also use the SLOPE and INTERCEPT methods to calculate the equation.

Slope, intercept and R2, respectively, given the example chirps per minute from above:

  • 0.144
  • 4.323
  • 0.999

Unfortunately, Sheets doesn’t have MSE, which I learned about in [1], which leads me to wonder, “What’s the relationship between R2 and MSE?” Per [3], we’re better off with MSE.

Digression into SciKit

[2] introduces Pandas after NumPy, but continuing the theme of building on understanding, I’d like to perform a linear regression in Colab, rather than copy-pasting into Sheets. I’ll follow [4] and [5] and defer Pandas until I need it for TensorFlow.

import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

actual_temps = np.arange(5,36)
chirps = np.array([7,13,15,27,31,38,45,57,57,67,76,85,89,94,100,109,116,120,131,134,144,149,158,165,170,176,187,189,197,208,215])

model = linear_model.LinearRegression()[:, np.newaxis], actual_temps)

predicted_temps = model.predict(chirps[:, np.newaxis])

plt.scatter(chirps, actual_temps)
plt.plot(chirps, predicted_temps)

# Starts the y-axis at zero, even though the data starts at 5

print('Slope: %.3f' % model.coef_)
print('Intercept: %.3f' % model.intercept_)
print('MSE: %.3f' % mean_squared_error(actual_temps, predicted_temps))
print('R2: %.3f' % r2_score(actual_temps, predicted_temps))

Slope, intercept, MSE and R2, respectively:

  • 0.144
  • 4.323
  • 0.085
  • 0.999

Note SciKit can calculate MSE and R2. Perhaps in line with [3], note MSE is non-zero, but R2 close to 100% 🤔 

As expected, Sheets is great for common stuff, but Colab/Jupyter shines for arbitrary calculation.


Coincidentally, TensorFlow’s fifth birthday was just a couple days ago 🥳

Continuing the theme of building on experience, I’m using the cricket chirp data for the synthetic exercise:

my_feature = ([float(i) for i in [7,13,15,27,31,38,45,57,57,67,76,85,89,94,100,109,116,120,131,134,144,149,158,165,170,176,187,189,197,208,215]])
my_label   = ([float(i) for i in range(5,36)])

The following settings enabled the cricket chirp data to converge with an RMSE ~ 0.8, which seems like a sweet spot of accuracy vs training time:

  • Learning: 0.01
  • Epochs: 50
  • Batch size: 1

Decreasing the learning rate (eg 0.001) and increasing the epochs (eg 500) converges with an RMSE ~0.5, but takes forever. Increasing the batch increases choppiness of the error tail.

The summary at the bottom of the synthetic data exercise seems generally useful:

  • “Training loss should steadily decrease, steeply at first, and then more slowly until the slope of the curve reaches or approaches zero.
  • If the training loss does not converge, train for more epochs.
  • If the training loss decreases too slowly, increase the learning rate. Note that setting the learning rate too high may also prevent training loss from converging.
  • If the training loss varies wildly (that is, the training loss jumps around), decrease the learning rate.
  • Lowering the learning rate while increasing the number of epochs or the batch size is often a good combination.
  • Setting the batch size to a very small batch number can also cause instability. First, try large batch size values. Then, decrease the batch size until you see degradation.
  • For real-world datasets consisting of a very large number of examples, the entire dataset might not fit into memory. In such cases, you’ll need to reduce the batch size to enable a batch to fit into memory.”

For the real data, there’s a note about the “max” being anomalous relative to the different percentiles, which makes sense, but is a little abstract. The plot does a good job showing outliers.

Interesting that the RMSE for the real data is ~100, rather than the zero I was going for with the synthetic data. I guess the point is that we’re trying to minimize loss, rather than eliminate it.

[2] uses California housing data, but we can browse other datasets at

Great tip to use corr to see which features correlate with a label, as an alternative to trial and error hyperparameter tuning.


  1. Google Machine Learning Crash Course: “Descending into ML”
  2. Google Machine Learning Crash Course: “First Steps with TensorFlow”
  3. University of Virginia Library: “Is R-squared Useless?”
  4. Python Data Science Handbook: “In Depth: Linear Regression” excerpt
  5. SciKit: “Linear Regression Example”

MLCC: Gradient descent

I am working through Google’s Machine Learning Crash Course. The notes in this post cover [2].

Earlier, I explored simplistic linear regression, largely based on [1]. The next section of the crash course ([2]) dives into “gradient descent” (GD), which raises the question “What’s wrong with the linear regression we just learned?” In short, the technique we just learned, Ordinary Least Squares (OLS), does not scale.

[3] clarifies linear regression can take a few forms depending input and processing constraints. Among these forms, OLS concerns one or more inputs where “all of the data must be available and you must have enough memory to fit the data and perform matrix operations” and uses least squares to find the best line. GD concerns “a very large dataset either in the number of rows or the number of columns that may not fit into memory.” As described by [4], OLS doesn’t scale. GD scales by finding a “numerical approximation … by iterative method”.

[2] introduces GD by descending a parabola, but it’s unclear how we transitioned from talking about straight lines in [1] to parabolas. The distinction is that we’re now focusing on loss functions. (To be fair, in retrospect, the title is “Reducing loss”🤦‍♂️) [2] asserts “For the kind of regression problems we’ve been examining, the resulting plot of loss vs. w1 will always be convex”, ie a parabola. OLS takes all the data and computes an optimal line, but GD iteratively generates lines and determines whether one is optimal by comparing the loss to the previous iteration.

[1] introduced the idea of quantifying the accuracy of a regression by calculating the loss. For example, it mentioned Mean Squared Error as a common loss function. [5] clarifies that Mean Squared Error is an exponential function. This provides helpful context for [2]’s definition of “gradient” as the derivative of the loss function.

I like the summary statement from [5]

The goal of any Machine Learning Algorithm is to minimize the Cost Function

[5] uses the interactive exercise from [2]. It’s reassuring to see convergence 😉

[4] presents a good example of a team trying to find the highest peak in a mountainous area by parachuting randomly over the range and reporting their local max daily. I can see how that would scale well for a large data set. Reminds me of MapReduce.

This example is a bit counter-intuitive, though, in that GD is trying to find a minimum (loss) rather than a maximum. It’d be better phrased as trying to find the deepest valley. Anyway, it states “Our aim is to reach the minima which is the valley bottom. So our gradient should be negative always … So if at our initial weights, the slope is negative, we are in the right direction”, which explains the “descent” in “gradient descent”.

[4] (like [2]) describes three forms of GD:

  1. Batch
  2. Stochastic
  3. Mini Batch

[2] defines “a batch” as “the total number of examples you use to calculate the gradient in a single iteration.” Presumably, it’s referring to Batch GD when it says “So far, we’ve assumed that the batch has been the entire data set.”

[2] describes Stochastic as picking one example at random for each iteration, which would take forever and may operate on redundant data, which is common in large data sets.

[2] states Mini Batch “reduces the amount of noise in SGD but is still more efficient than full-batch” because it uses batches of 10-1000 random examples, and that Mini Batch is what’s used in practice.

When do we stop iterating? [2] states “you iterate until overall loss stops changing or at least changes extremely slowly. When that happens, we say that the model has converged.”

To summarize:

  1. Initialize with arbitrary weights
  2. Generate a model
  3. Sample (labeled) examples
  4. Input sample into the model
  5. Calculate the loss
  6. Compare the new loss with the previous loss
  7. If loss is decreasing
    1. Add the step value to the weight
    2. Repeat from step 2


  1. Google Machine Learning Crash Course: “Descending into ML”
  2. Google Machine Learning Crash Course: “Reducing loss”
  3. Machine Learning Mastery: “Linear Regression for Machine Learning”
  4. Towards Data Science: “Optimization: Ordinary Least Squares Vs. Gradient Descent — from scratch”
  5. Towards Data Science: “Understanding the Mathematics behind Gradient Descent”

MLCC: Linear regression

I am working through Google’s Machine Learning Crash Course. The notes in this post cover [1] and [2].

A lot of ML quickstarts dive right into jargon like model, feature, y’, L2, etc, which makes it hard for me to learn the basics – “what are we doing and why?”

The crash course also presents some jargon, but at least explains each concept and links to a glossary, which makes it easier to learn.

After a few days of poking around, one piece of jargon seems irreducible: linear regression. In other words, this is the kind of basic ML concept I’ve been looking for. This is where I’d start if I was helping someone learn ML.

I probably learned about linear regression in the one statistics class I took in college, but have forgotten about it after years of string parsing 🙂

The glossary entry for linear regression describes it as “Using the raw output (y’) of a linear model as the actual prediction in a regression model”, which is still too dense for me.

The linear regression module of the crash course is closer to my level:

Linear regression is a method for finding the straight line … that best fits a set of points.

The crash course provides a good example of a line fitting points describing cricket chirps per minute per temperature:

Google's example of a line fitting cricket chirps by temperature

The “linear” in “linear regression” refers to this straight line, as in linear equation. The “regression” refers to “regression to the mean”, which is a statistical observation unfortunately unrelated to statistical methods like the least squares technique described below, as explained humorously by John Seymour.

Math is Fun describes a technique called “least squares regression” for finding such a line. Google’s glossary also has an entry for least squares regression, which gives me confidence that I’m bridging my level (Math is Fun) with the novel concept of ML.

Helpful tip from StatQuest’s “Machine Learning Fundamentals: Bias and Variance”: differences are squared so that negative distances don’t cancel out positive distances.

Math is Fun’s article on linear equations and the crash course’s video on linear regression reminded me of the slope-intercept form of a linear equation I learned about way back when: y = mx + b.

The crash course even describes this equation as a “model”: “By convention in machine learning, you’ll write the equation for a model slightly differently …”

All this helps me understand in the most basic sense:

  • A “model” is just an equation
  • “Training” and “learning” are just performing a regression calculation to generate an equation
  • Performing these calculations regularly and on large data sets is tedious and error prone, so we use a computer, hence “machine learning”
  • “Prediction” and “inference” are just plugging x values into the equation


  1. Google Machine Learning Crash Course: “Framing”
  2. Google Machine Learning Crash Course: “Descending into ML”

Bias to the server-side

Problem statement

I was recently working on a support issue, which had client- and server-side aspects. Complicating the issue, we only had partial visibility into server-side health and no visilibity into client-side health. It was hard to even tell where to start investigating. We were also working closely with a partner, who could give us some visilbility, but with high coordination cost.

One approach was to create client visiblity for us and the partner, but this would take time to roll out, didn’t immediately reduce the coordination cost and risked fatiguing the partner.

An alternative approach was to increase the server-side visiblity. We took this approach because we could start investigating immediately (no coordination cost or roll out latency). We might even be able to resolve the issue without requiring the partner to do any work. Also, having more visibility and confidence in the server-side would help if/when we do need to make client-side changes.


So, my takeaway is simple: when faced with a choice between server- and client-side options, bias toward server-side.

From a product perspective, any work required of customers is friction to adoption. From a technical perspective, it’s much easier to change servers than clients.

The value of a dashboard

Problem statement

A project I’m familiar with recently had a series of issues. Each issue was investigated somewhat independently. It was hard to share common code, share data across roles and track progress over time.


  1. Capture canonical queries in version control
  2. Periodically run queries, persist and visualize output (aka ETL)
  3. At a higher level, invest in tooling to facilitate such dashboard creation

The end result is much more awareness of the underlying data. Folks in different roles can see the data and ask questions, which often improves the quality of analysis. For example, we now review the dashboard weekly and look for changes as we roll out fixes. Because we now have a pipeline, we can also run different data sources through it to check the analysis.

Resume guidance

I’ve recently been reviewing student resumes as part of a university recruiting program and a few best-practices stand out.

Google has a good video on resume formatting that describes most of these practices. In these cases, I mention a time in the video.

Ok. Here are the best-practices:

  • Recommended resume format at ~2:30
  • Resumes are read in the context of a job description (example). Companies hiring for a job read a resume to see if you have the training and/or experience to do that job, so
    • You can briefly describe what the project was, but invest most space describing what you did and how you did it. Google recommends a phrasing at ~6:00: “Accomplished [x] as measured by [y] doing [z]”
    • Bold key words to highlight your toolkit (~5:25), eg “backend engineer”, “python”, etc, so the reader can match the resume to the role at a glance.
  • It takes a lot of effort to tune a resume for all applications, but may be worth it for the 1-2 jobs you really want
  • Ideally, your resume tells a story, eg some Python in year 1, more Python in year 2, created a service using Python in year 3 –> this person has experience in Python and is using it to start exploring service eng
  • Submitting a resume may feel like an impersonal process, but it’s just people on the other side looking for new teammates. That’ll be you after you’re hired 🙂 The points above try to make it easy for that person to see you’re a great fit for the job.


This is an organizational pattern I like:

  • 2-5 ppl
  • Cross-functional
  • Focused on a specific goal
  • Weekly demo to squad

I’ve heard this refered to as a “squad”, “swarm”, “e team” and “feature team”.

One of the nice things is the sense of comraderie from working closely with a small group on a specific goal. Another nice thing is the group dissolves after the goal is accomplished, giving the members closure and chances to try new things without switching teams. Another benefit is broad awareness of how a system works.

This pattern works well in a larger context:

  • Shared code ownership
  • 10-50 ppl
  • Focused on a specific, but larger goal
  • Fortnightly demos from all squads
  • Shared calendar for all squad coordination meetings

When a squad accomplishes its goal, the members dissolve into the larger group. Individuals can learn about work in the larger group by attending the fortnightly demos and/or sitting in on other squads’ coordination meetings.

For example, a product may be supported by several teams. To avoid exposing the org chart in the product, make a large team who’s goal is to make that product excellent. Define a squad for each significant work item. All members of the large group are free to contribute code to the product.

An underlying principle is alignment around customer value over specific products or features. Rather than a group of people owning a code base in perpetuity, regardless of the amount of work required, squads form in response to need.

A counter-example would be several teams supporting a product and that product having disjoint features. Another counter-example would be a large team trying to maintain the interest of all its members in weekly coordination meetings. Yet another counter-example would be lots of individuals working in isolation, esp if they’re doing so to avoid coordination co