Google maintains a helpful glossary for ML terms. The ML Crash Course and the TensorFlow docs link to this glossary.

# Google’s ML best-practices

Google maintains a long list of best practices for ML engineering.

# MLCC: Feature crosses

I am working through Google’s Machine Learning Crash Course. The notes in this post cover the “Feature Crosses” section.

âFeature crossâ, âfeature cross productâ and âsynthetic featureâ are synonymous. A feature cross is the cross product of two features. The nonlinearity sub-section states âThe term cross comes from cross product.â Thinking of it as a Cartessian product, which the glossary mentions, helps me grok whatâs going on, and why itâs helpful for the example problem where examples are clustered by quarter (to consider x-y pairs), and esp the exercise involving latitude and longitude pairs.

The video states âLinear learners use linear modelsâ. What is a âlinear modelâ? Given âmodelâ is synonymous with âequationâ or âfunctionâ, a âlinear modelâ is a linear equation. For example, Brilliantâs wiki states: âA linear model is an equation …â What is a âlinear learnerâ? The video might just be stating a fact: something that learns using a linear model is a âlinear learnerâ. For example, Amazon SageMakerâs Linear Learner docs states âThe algorithm learns a linear functionâ.

A âlinear problemâ describes a relationship that can be expressed using a straight line (to divide the input data). âNonlinear problemsâ cannot be expressed this way.

While trying to figure out why the exercise used an indicator_column, I found some nice TensorFlow tutorials, eg for feature crosses. In retrospect, I see the indicator_column docs state simply âRepresents multi-hot representation of given categorical column.â

# MLCC: Representation

I am working through Google’s Machine Learning Crash Course. The notes in this post cover the âRepresentationâ section.

feature engineering is another topic which doesnât seem to merit any review papers or books, or even chapters in books, but it is absolutely vital to ML success. [âŚ] Much of the success of machine learning is actually success in engineering features that a learner can understand.

Scott Locklin, in âNeglected machine learning ideasâ AQI Machine Learning Masteryâs feature engineering overview

Iâve heard 80% of data science is cleaning. This section introduces a nuance: cleaning includes a step mapping raw data into a format that’s appropriate and efficient for inputting into a model. The âscrubbingâ sub-section actually seems like the only thing that fits what I previously thought of as âcleaningâ, eg removing human errors, addressing incomplete data, etc.

The whole section has good recommendations I can see serving as an ongoing reference. For example:

- Good feature values should appear more than 5 or so times in a data set âŚ avoid unique IDs
- Keep data pure by not encoding exceptional states into a featureâs value type, eg an integer feature where -1 means undefined, aka âmagicâ values. Instead, use boolean flags for exceptional states.

The “Z score” scales values as follows: `scaled = (value - mean) stdev`

. Math is Fun has a good explanation for how to derive the standard deviation, but Pandas also provides it trivially in the output from `describe`

.

âBinningâ seems similar to *-hot encoding in that weâre enabling weights for each value, although the former concerns continuous values and the latter concerns discrete values. The feature cross video supports this by referring to both in the same context.

Histograms and stats, like those output by `describe`

, can help detect bad data.

# MLCC: Generalization

I am working through Google’s Machine Learning Crash Course. The notes in this post cover [1] through [3].

The fundamental tension of machine learning is between fitting our data well, but also fitting the data as simply as possible.

A reasonable guideline: âThe less complex an ML model, the more likely that a good empirical result is not just due to the peculiarities of the sample.â

[2] recommends a best-practice: divide labeled examples into âtrainingâ and âtestâ sets.

Never train on test data! 100% accuracy can be a symptom of that.

[3] goes further: divide labeled examples into three sets: “training”, âvalidationâ and “test”. Simply testing against a “test” set risks overfitting to that set. Instead, iterate against the validation set, and then double-check using the test set.

A continuing impression is that TensorFlow builds in a lot of the best-practices described in this crash course. For example, splitting out a validation set and testing against it is a first-class argument to the Model.fit method.

The exercise associated with [3] is interesting. First, testing against a validation set caught a bug! Second, the bug was a default sort on the latitude column; the validation set was not a random sample.

## References

# MLCC: First Steps with TensorFlow

I am working through Google’s Machine Learning Crash Course. The notes in this post cover [2].

[2] introduces Colab, NumPy, Pandas and TensorFlow.

Colab is like a hosted Jupyter notebook and provides an easy way to play with Python ML libraries, among other things.

NumPy provides performant and user-friendly collections and operations for linear algebra.

Pandas provides tools for working with âdataframesâ, which are like spreadsheets in memory.

## Digression into Google Sheets

I like building on my understanding. In this context, I want to learn Colab and NumPy by using them to work with the cricket chirp data introduced in [1].

[1] used cricket chirps per minute per temperature as an example, but didnât provide raw data. Dolbearâs Law provides an equation we can use to generate data: T_{C} = 10 + (N_{60} – 40) / 7 â N_{60} = 7 * T_{C} – 30

Colab and NumPy provide an easy way to use this equation:

```
import numpy as np
# Starts by generating temp, since chirps are dependent on temp.
# Starts at 5 because Dolbearâs formula results in a negative value below 5 degrees
temps = np.arange(5,36)
# Adds noise to avoid an obviously linear relationship.
# Copies the approach from âNumPy UltraQuick Tutorialâ linked from [2].
# Sets low of -5, which limits the minimum chirps to zero.
noise = np.random.randint(low=-5, high=5, size=36)
chirps = 7 * temps - 30 + noise
# Prints CSVs, since Google Sheets knows how to split CSVs on paste.
print(','.join([str(i) for i in temps]))
print(','.join([str(i) for i in chirps]))
```

Example chirps per minute:

7,13,15,27,31,38,45,57,57,67,76,85,89,94,100,109,116,120,131,134,144,149,158,165,170,176,187,189,197,208,215

Note this generates synthetic data for chirps per minute, but then Iâll use them to predict temperature, ie chirps is the feature and temperature is the label.

Copy the temps and chirps CSVs. In Sheets, *Edit > paste special > paste comma-separated text (CSV) as columns*.

To improve readability, cut the pasted content and *Edit > paste special > paste transposed* to convert row data to column data.

Add column headers, select everything and then *Insert > Chart*.

Select âScatter chartâ for the chart type. Under *Customize > Series*, check the trendline box. Select âEquationâ for the label to get the regression equation. Check the R_{2} box.

We can also use the SLOPE and INTERCEPT methods to calculate the equation.

Slope, intercept and R_{2}, respectively, given the example chirps per minute from above:

- 0.144
- 4.323
- 0.999

Unfortunately, Sheets doesnât have MSE, which I learned about in [1], which leads me to wonder, âWhatâs the relationship between R_{2} and MSE?â Per [3], weâre better off with MSE.

## Digression into SciKit

[2] introduces Pandas after NumPy, but continuing the theme of building on understanding, Iâd like to perform a linear regression in Colab, rather than copy-pasting into Sheets. Iâll follow [4] and [5] and defer Pandas until I need it for TensorFlow.

```
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
actual_temps = np.arange(5,36)
chirps = np.array([7,13,15,27,31,38,45,57,57,67,76,85,89,94,100,109,116,120,131,134,144,149,158,165,170,176,187,189,197,208,215])
model = linear_model.LinearRegression()
model.fit(chirps[:, np.newaxis], actual_temps)
predicted_temps = model.predict(chirps[:, np.newaxis])
plt.scatter(chirps, actual_temps)
plt.plot(chirps, predicted_temps)
# Starts the y-axis at zero, even though the data starts at 5
plt.ylim(0)
print('Slope: %.3f' % model.coef_)
print('Intercept: %.3f' % model.intercept_)
print('MSE: %.3f' % mean_squared_error(actual_temps, predicted_temps))
print('R2: %.3f' % r2_score(actual_temps, predicted_temps))
```

Slope, intercept, MSE and R2, respectively:

- 0.144
- 4.323
- 0.085
- 0.999

Note SciKit can calculate MSE and R2. Perhaps in line with [3], note MSE is non-zero, but R2 close to 100% đ¤

As expected, Sheets is great for common stuff, but Colab/Jupyter shines for arbitrary calculation.

## TensorFlow

Coincidentally, TensorFlow’s fifth birthday was just a couple days ago đĽł

Continuing the theme of building on experience, Iâm using the cricket chirp data for the synthetic exercise:

```
my_feature = ([float(i) for i in [7,13,15,27,31,38,45,57,57,67,76,85,89,94,100,109,116,120,131,134,144,149,158,165,170,176,187,189,197,208,215]])
my_label = ([float(i) for i in range(5,36)])
```

The following settings enabled the cricket chirp data to converge with an RMSE ~ 0.8, which seems like a sweet spot of accuracy vs training time:

- Learning: 0.01
- Epochs: 50
- Batch size: 1

Decreasing the learning rate (eg 0.001) and increasing the epochs (eg 500) converges with an RMSE ~0.5, but takes forever. Increasing the batch increases choppiness of the error tail.

The summary at the bottom of the synthetic data exercise seems generally useful:

- “Training loss should steadily decrease, steeply at first, and then more slowly until the slope of the curve reaches or approaches zero.
- If the training loss does not converge, train for more epochs.
- If the training loss decreases too slowly, increase the learning rate. Note that setting the learning rate too high may also prevent training loss from converging.
- If the training loss varies wildly (that is, the training loss jumps around), decrease the learning rate.
- Lowering the learning rate while increasing the number of epochs or the batch size is often a good combination.
- Setting the batch size to a
*very*small batch number can also cause instability. First, try large batch size values. Then, decrease the batch size until you see degradation. - For real-world datasets consisting of a very large number of examples, the entire dataset might not fit into memory. In such cases, you’ll need to reduce the batch size to enable a batch to fit into memory.”

For the real data, thereâs a note about the âmaxâ being anomalous relative to the different percentiles, which makes sense, but is a little abstract. The plot does a good job showing outliers.

Interesting that the RMSE for the real data is ~100, rather than the zero I was going for with the synthetic data. I guess the point is that weâre trying to minimize loss, rather than eliminate it.

[2] uses California housing data, but we can browse other datasets at https://datasetsearch.research.google.com/.

Great tip to use corr to see which features correlate with a label, as an alternative to trial and error hyperparameter tuning.

## References

# MLCC: Gradient descent

I am working through Google’s Machine Learning Crash Course. The notes in this post cover [2].

Earlier, I explored simplistic linear regression, largely based on [1]. The next section of the crash course ([2]) dives into âgradient descentâ (GD), which raises the question âWhatâs wrong with the linear regression we just learned?â In short, the technique we just learned, Ordinary Least Squares (OLS), **does not scale**.

[3] clarifies linear regression can take a few forms depending input and processing constraints. Among these forms, OLS concerns one or more inputs where âall of the data must be available and you must have enough memory to fit the data and perform matrix operationsâ and uses least squares to find the best line. GD concerns âa very large dataset either in the number of rows or the number of columns that may not fit into memory.â As described by [4], OLS doesnât scale. GD scales by finding a ânumerical approximation âŚ by iterative methodâ.

[2] introduces GD by descending a parabola, but itâs unclear how we transitioned from talking about straight lines in [1] to parabolas. The distinction is that **weâre now focusing on loss functions**. (To be fair, in retrospect, the title is “Reducing loss”đ¤Śââď¸) [2] asserts âFor the kind of regression problems we’ve been examining, the resulting plot of loss vs. w_{1} will always be convexâ, ie a parabola. OLS takes all the data and computes an optimal line, but GD iteratively generates lines and determines whether one is optimal by comparing the loss to the previous iteration.

[1] introduced the idea of quantifying the accuracy of a regression by calculating the loss. For example, it mentioned Mean Squared Error as a common loss function. [5] clarifies that Mean *Squared* Error is an exponential function. This provides helpful context for [2]âs definition of âgradientâ as the derivative of the loss function.

I like the summary statement from [5]:

The goal of any Machine Learning Algorithm is to minimize the Cost Function

[5] uses the interactive exercise from [2]. Itâs reassuring to see convergence đ

[4] presents a good example of a team trying to find the highest peak in a mountainous area by parachuting randomly over the range and reporting their local max daily. I can see how that would scale well for a large data set. Reminds me of MapReduce.

This example is a bit counter-intuitive, though, in that GD is trying to find a minimum (loss) rather than a maximum. Itâd be better phrased as trying to find the deepest valley. Anyway, it states âOur aim is to reach the minima which is the valley bottom. So our gradient should be negative always âŚ So if at our initial weights, the slope is negative, we are in the right directionâ, which explains the âdescentâ in âgradient descentâ.

[4] (like [2]) describes three forms of GD:

- Batch
- Stochastic
- Mini Batch

[2] defines âa batchâ as âthe total number of examples you use to calculate the gradient in a single iteration.â Presumably, itâs referring to Batch GD when it says âSo far, we’ve assumed that the batch has been the entire data set.â

[2] describes Stochastic as picking one example at random for each iteration, which would take forever and may operate on redundant data, which is common in large data sets.

[2] states Mini Batch âreduces the amount of noise in SGD but is still more efficient than full-batchâ because it uses batches of 10-1000 random examples, and that Mini Batch is whatâs used in practice.

When do we stop iterating? [2] states âyou iterate until overall loss stops changing or at least changes extremely slowly. When that happens, we say that the model has converged.â

To summarize:

- Initialize with arbitrary weights
- Generate a model
- Sample (labeled) examples
- Input sample into the model
- Calculate the loss
- Compare the new loss with the previous loss
- If loss is decreasing
- Add the step value to the weight
- Repeat from step 2

## References

- Google Machine Learning Crash Course: “Descending into ML”
- Google Machine Learning Crash Course: “Reducing loss”
- Machine Learning Mastery: âLinear Regression for Machine Learningâ
- Towards Data Science: âOptimization: Ordinary Least Squares Vs. Gradient Descent â from scratchâ
- Towards Data Science: âUnderstanding the Mathematics behind Gradient Descentâ

# MLCC: Linear regression

I am working through Google’s Machine Learning Crash Course. The notes in this post cover [1] and [2].

A lot of ML quickstarts dive right into jargon like model, feature, y’, L_{2}, etc, which makes it hard for me to learn the basics – âwhat are we doing and why?â

The crash course also presents some jargon, but at least explains each concept and links to a glossary, which makes it easier to learn.

After a few days of poking around, one piece of jargon seems irreducible: linear regression. In other words, this is the kind of basic ML concept Iâve been looking for. This is where Iâd start if I was helping someone learn ML.

I probably learned about linear regression in the one statistics class I took in college, but have forgotten about it after years of string parsing đ

The glossary entry for linear regression describes it as âUsing the raw output (yâ) of a linear model as the actual prediction in a regression modelâ, which is still too dense for me.

The linear regression module of the crash course is closer to my level:

Linear regression is a method for finding the straight line âŚ that best fits a set of points.

The crash course provides a good example of a line fitting points describing cricket chirps per minute per temperature:

The âlinearâ in âlinear regressionâ refers to this straight line, as in linear equation. The “regression” refers to “regression to the mean”, which is a statistical observation unfortunately unrelated to statistical methods like the least squares technique described below, as explained humorously by John Seymour.

Math is Fun describes a technique called âleast squares regressionâ for finding such a line. Googleâs glossary also has an entry for least squares regression, which gives me confidence that Iâm bridging my level (Math is Fun) with the novel concept of ML.

Helpful tip from StatQuestâs âMachine Learning Fundamentals: Bias and Varianceâ: differences are squared so that negative distances donât cancel out positive distances.

Math is Funâs article on linear equations and the crash courseâs video on linear regression reminded me of the slope-intercept form of a linear equation I learned about way back when: `y = mx + b`

.

The crash course even describes this equation as a âmodelâ: âBy convention in machine learning, you’ll write the equation for a model slightly differently …â

All this helps me understand in the most basic sense:

- A âmodelâ is just an equation
- âTrainingâ and âlearningâ are just performing a regression calculation to generate an equation
- Performing these calculations regularly and on large data sets is tedious and error prone, so we use a computer, hence âmachine learningâ
- âPredictionâ and âinferenceâ are just plugging x values into the equation

## Resources

# Bias to the server-side

## Problem statement

I was recently working on a support issue, which had client- and server-side aspects. Complicating the issue, we only had partial visibility into server-side health and no visilibity into client-side health. It was hard to even tell where to start investigating. We were also working closely with a partner, who could give us some visilbility, but with high coordination cost.

One approach was to create client visiblity for us and the partner, but this would take time to roll out, didn’t immediately reduce the coordination cost and risked fatiguing the partner.

An alternative approach was to increase the server-side visiblity. We took this approach because we could start investigating immediately (no coordination cost or roll out latency). We might even be able to resolve the issue without requiring the partner to do any work. Also, having more visibility and confidence in the server-side would help if/when we do need to make client-side changes.

## Solution

So, my takeaway is simple: when faced with a choice between server- and client-side options, bias toward server-side.

From a product perspective, any work required of customers is friction to adoption. From a technical perspective, it’s much easier to change servers than clients.

# The value of a dashboard

## Problem statement

A project I’m familiar with recently had a series of issues. Each issue was investigated somewhat independently. It was hard to share common code, share data across roles and track progress over time.

## Solution

- Capture canonical queries in version control
- Periodically run queries, persist and visualize output (aka ETL)
- At a higher level, invest in tooling to facilitate such dashboard creation

The end result is much more awareness of the underlying data. Folks in different roles can see the data and ask questions, which often improves the quality of analysis. For example, we now review the dashboard weekly and look for changes as we roll out fixes. Because we now have a pipeline, we can also run different data sources through it to check the analysis.