I am working through Google’s Machine Learning Crash Course. The notes in this post cover [2].

[2] introduces Colab, NumPy, Pandas and TensorFlow.

Colab is like a hosted Jupyter notebook and provides an easy way to play with Python ML libraries, among other things.

NumPy provides performant and user-friendly collections and operations for linear algebra.

Pandas provides tools for working with â€śdataframesâ€ť, which are like spreadsheets in memory.

## Digression into Google Sheets

I like building on my understanding. In this context, I want to learn Colab and NumPy by using them to work with the cricket chirp data introduced in [1].

[1] used cricket chirps per minute per temperature as an example, but didnâ€™t provide raw data. Dolbearâ€™s Law provides an equation we can use to generate data: T_{C} = 10 + (N_{60} – 40) / 7 â†’ N_{60} = 7 * T_{C} – 30

Colab and NumPy provide an easy way to use this equation:

```
import numpy as np
# Starts by generating temp, since chirps are dependent on temp.
# Starts at 5 because Dolbearâ€™s formula results in a negative value below 5 degrees
temps = np.arange(5,36)
# Adds noise to avoid an obviously linear relationship.
# Copies the approach from â€śNumPy UltraQuick Tutorialâ€ś linked from [2].
# Sets low of -5, which limits the minimum chirps to zero.
noise = np.random.randint(low=-5, high=5, size=36)
chirps = 7 * temps - 30 + noise
# Prints CSVs, since Google Sheets knows how to split CSVs on paste.
print(','.join([str(i) for i in temps]))
print(','.join([str(i) for i in chirps]))
```

Example chirps per minute:

7,13,15,27,31,38,45,57,57,67,76,85,89,94,100,109,116,120,131,134,144,149,158,165,170,176,187,189,197,208,215

Note this generates synthetic data for chirps per minute, but then Iâ€™ll use them to predict temperature, ie chirps is the feature and temperature is the label.

Copy the temps and chirps CSVs. In Sheets, *Edit > paste special > paste comma-separated text (CSV) as columns*.

To improve readability, cut the pasted content and *Edit > paste special > paste transposed* to convert row data to column data.

Add column headers, select everything and then *Insert > Chart*.

Select â€śScatter chartâ€ť for the chart type. Under *Customize > Series*, check the trendline box. Select â€śEquationâ€ť for the label to get the regression equation. Check the R_{2} box.

We can also use the SLOPE and INTERCEPT methods to calculate the equation.

Slope, intercept and R_{2}, respectively, given the example chirps per minute from above:

- 0.144
- 4.323
- 0.999

Unfortunately, Sheets doesnâ€™t have MSE, which I learned about in [1], which leads me to wonder, â€śWhatâ€™s the relationship between R_{2} and MSE?â€ť Per [3], weâ€™re better off with MSE.

## Digression into SciKit

[2] introduces Pandas after NumPy, but continuing the theme of building on understanding, Iâ€™d like to perform a linear regression in Colab, rather than copy-pasting into Sheets. Iâ€™ll follow [4] and [5] and defer Pandas until I need it for TensorFlow.

```
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
actual_temps = np.arange(5,36)
chirps = np.array([7,13,15,27,31,38,45,57,57,67,76,85,89,94,100,109,116,120,131,134,144,149,158,165,170,176,187,189,197,208,215])
model = linear_model.LinearRegression()
model.fit(chirps[:, np.newaxis], actual_temps)
predicted_temps = model.predict(chirps[:, np.newaxis])
plt.scatter(chirps, actual_temps)
plt.plot(chirps, predicted_temps)
# Starts the y-axis at zero, even though the data starts at 5
plt.ylim(0)
print('Slope: %.3f' % model.coef_)
print('Intercept: %.3f' % model.intercept_)
print('MSE: %.3f' % mean_squared_error(actual_temps, predicted_temps))
print('R2: %.3f' % r2_score(actual_temps, predicted_temps))
```

Slope, intercept, MSE and R2, respectively:

- 0.144
- 4.323
- 0.085
- 0.999

Note SciKit can calculate MSE and R2. Perhaps in line with [3], note MSE is non-zero, but R2 close to 100% đź¤”

As expected, Sheets is great for common stuff, but Colab/Jupyter shines for arbitrary calculation.

## TensorFlow

Coincidentally, TensorFlow’s fifth birthday was just a couple days ago đźĄł

Continuing the theme of building on experience, Iâ€™m using the cricket chirp data for the synthetic exercise:

```
my_feature = ([float(i) for i in [7,13,15,27,31,38,45,57,57,67,76,85,89,94,100,109,116,120,131,134,144,149,158,165,170,176,187,189,197,208,215]])
my_label = ([float(i) for i in range(5,36)])
```

The following settings enabled the cricket chirp data to converge with an RMSE ~ 0.8, which seems like a sweet spot of accuracy vs training time:

- Learning: 0.01
- Epochs: 50
- Batch size: 1

Decreasing the learning rate (eg 0.001) and increasing the epochs (eg 500) converges with an RMSE ~0.5, but takes forever. Increasing the batch increases choppiness of the error tail.

The summary at the bottom of the synthetic data exercise seems generally useful:

- “Training loss should steadily decrease, steeply at first, and then more slowly until the slope of the curve reaches or approaches zero.
- If the training loss does not converge, train for more epochs.
- If the training loss decreases too slowly, increase the learning rate. Note that setting the learning rate too high may also prevent training loss from converging.
- If the training loss varies wildly (that is, the training loss jumps around), decrease the learning rate.
- Lowering the learning rate while increasing the number of epochs or the batch size is often a good combination.
- Setting the batch size to a
*very*small batch number can also cause instability. First, try large batch size values. Then, decrease the batch size until you see degradation. - For real-world datasets consisting of a very large number of examples, the entire dataset might not fit into memory. In such cases, you’ll need to reduce the batch size to enable a batch to fit into memory.”

For the real data, thereâ€™s a note about the â€śmaxâ€ť being anomalous relative to the different percentiles, which makes sense, but is a little abstract. The plot does a good job showing outliers.

Interesting that the RMSE for the real data is ~100, rather than the zero I was going for with the synthetic data. I guess the point is that weâ€™re trying to minimize loss, rather than eliminate it.

[2] uses California housing data, but we can browse other datasets at https://datasetsearch.research.google.com/.

Great tip to use corr to see which features correlate with a label, as an alternative to trial and error hyperparameter tuning.