MLCC: Neural Networks

I am working through Google’s Machine Learning Crash Course. The notes in this post cover the “Neural Networks” module.

Does “deep learning” imply neural networks?

The introductory video refers to “deep neural networks”, so I’m wondering what the relationship is between deep learning and neural networks.

Yes, according to Quora’s “Does deep learning always mean neural network or can include other ML techniques?”.

“To give you some context, modern Convolutional Networks contain on orders of 100 million parameters and are usually made up of approximately 10-20 layers (hence deep learning)” – https://cs231n.github.io/neural-networks-1/

“Deep Learning is simply a subset of the architectures (or templates) that employs ‘neural networks’” – https://towardsdatascience.com/intuitive-deep-learning-part-1a-introduction-to-neural-networks-aaeb3a1500df (TDS)

“Deep learning” in Google’s glossary links to “deep model”: “A type of neural network containing multiple hidden layers.”

“However, until 2006 we didn’t know how to train neural networks to surpass more traditional approaches, except for a few specialized problems. What changed in 2006 was the discovery of techniques for learning in so-called deep neural networks.” – http://neuralnetworksanddeeplearning.com/about.html

Toward’s Data Science’s “Intuitive Deep Learning Part 1a: Introduction to Neural Networks” clarifies “deep learning” is a subset of machine learning. I guess they’re both “learning”. I like the comparison of an algorithm to a recipe, and in this context, ML optimizes a recipe. Deep learning is a subset of optimization techniques.

When to use neural networks?

Small data with linear relationships → LSR

Large data with linear relationships → gradient descent

Large data with simple, nonlinear relationships → feature crosses

Large data with complex, nonlinear relationships → NN

“Neural nets will give us a way to learn nonlinear models without the use of explicit feature crosses” – https://developers.google.com/machine-learning/crash-course/introduction-to-neural-networks/playground-exercises

“Neural networks, a beautiful biologically-inspired programming paradigm which enables a computer to learn from observational data” – http://neuralnetworksanddeeplearning.com/index.html

NN “have the flexibility to model many complicated relationships between input and output”- https://towardsdatascience.com/intuitive-deep-learning-part-1a-introduction-to-neural-networks-aaeb3a1500df

“That’s not to say that neural networks aren’t good at solving simpler problems. They are. But so are many other algorithms. The complexity, resource-intensiveness and lack of interpretability in neural networks is sometimes a necessary evil, but it’s only warranted when simpler methods are inapplicable” – https://www.quora.com/What-kinds-of-machine-learning-problems-are-neural-networks-particularly-good-at-solving

Why are there multiple layers?

“each layer is effectively learning a more complex, higher-level function over the raw inputs” – https://developers.google.com/machine-learning/crash-course/introduction-to-neural-networks/anatomy

“A single-layer neural network can only be used to represent linearly separable functions … Most problems that we are interested in solving are not linearly separable.” – https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/

The universal approximation theory states that one hidden layer is sufficient for any problem – https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/

“How many hidden layers? Well if your data is linearly separable (which you often know by the time you begin coding a NN) then you don’t need any hidden layers at all. Of course, you don’t need an NN to resolve your data either, but it will still do the job.” – https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw

“One hidden layer is sufficient for the large majority of problems.” – https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw

“Even for those functions that can be learned via a sufficiently large one-hidden-layer MLP, it can be more efficient to learn it with two (or more) hidden layers” – https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/

“Multi-layer” implies at least one hidden layer: “It has an input layer that connects to the input variables, one or more hidden layers” – https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/

Chris Olah’s “Neural Networks, Manifolds and Topology”, linked from the crash course, visualizes how data sets intersecting in n dimensions may be disjoint in n + 1 dimensions, which enables a linear solution. Other than that, though, Olah’s article was over my head. Articles like TDS are more my speed.

Why are some layers called “hidden”?

“The interior layers are sometimes called “hidden layers” because they are not directly observable from the systems inputs and outputs.” – https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/

How many layers do I need?

Task 4 in the exercise recommends playing around with the hyperparameters to get a certain loss, but the combinatorial complexity makes me wonder if there’s an intuitive way to think about the role of layers and neurons. 🤔

“Regardless of the heuristics you might encounter, all answers will come back to the need for careful experimentation to see what works best for your specific dataset” – https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/

“In sum, for most problems, one could probably get decent performance (even without a second optimization step) by setting the hidden layer configuration using just two rules: (i) number of hidden layers equals one; and (ii) the number of neurons in that layer is the mean of the neurons in the input and output layers.” – https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw

“3 neurons are enough because the XOR function can be expressed as a combination of 3 half-planes (ReLU activation)” – https://developers.google.com/machine-learning/crash-course/introduction-to-neural-networks/playground-exercises Seems narrowing the problem space to ReLU enables some deterministic optimization.

“The sigmoid and hyperbolic tangent activation functions cannot be used in networks with many layers due to the vanishing gradient problem” – https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/

“use as big of a neural network as your computational budget allows, and use other regularization techniques to control overfitting” – https://cs231n.github.io/neural-networks-1/#arch

“a model with 1 neuron in the first hidden layer cannot learn a good model no matter how deep it is. This is because the output of the first layer only varies along one dimension (usually a diagonal line), which isn’t enough to model this data set well” – https://developers.google.com/machine-learning/crash-course/introduction-to-neural-networks/playground-exercises

“A single layer with more than 3 neurons has more redundancy, and thus is more likely to converge to a good model” – https://developers.google.com/machine-learning/crash-course/introduction-to-neural-networks/playground-exercises

Two hidden layers with eight neurons in the first and two in the second performed well (~0.15 loss) on repeated runs.

Heuristics from spiral solution video:

  1. Tune number of layers and nodes. Max neurons in the first layer, tapering down a couple layers to the output is a reasonable start. Each neuron takes time to train, though, so reduce total neurons if training is too slow. This is reinforced by the practice exercise, which started with two layers of 20 and 12 neurons, and then tried to reduce the number of neurons while keeping loss stable.
  2. Reduce the learning rate to smooth loss curve
  3. Add regularization to further smooth loss curve
  4. Feature engineering helps with noisy data
  5. Try different activation functions. Ultimately, tanh had the best fit
  6. Iterate from 1

Even after all this, tuning hyper parameters still seems combinatorially complex.

Activation functions

A neural net consists of layers. Nodes in the bottom layer are linear equations. Nodes in a “hidden” layer transform a linear node into a non-linear node using an “activation function”. The crash course states “any mathematical function can serve as an activation function”.

A sigmoid is an example of an activation function. I remember from the module on logistic regression (notes) that we used a sigmoid to transform a linear equation into a probability.

Why is it called a “neuron”?

The glossary definition for “neuron” is pretty good: 1) “taking in multiple input values and generating one output value”, and 2) ”The neuron calculates the output value by applying an activation function.” Aside: this reminds me of lambda architecture. I appreciate TDS clarifying neurons “often take some linear combination of the inputs”, like w1x1 + w2x2 + w3x3. I suppose this is what the glossary means by “a weighted sum of input values”.

TDS references a single image from the biological motivations section of Stanford’s CS231n, but I find both the images from that section useful for comparison.

I like TDS’ definition of a “layer” as “a “neural network” is simply made out of layers of neurons, connected in a way that the input of one layer of neuron is the output of the previous layer of neurons”. In that context, the hidden layer diagrams from the crash course makes sense.

MLCC: Regularization for sparsity

I am working through Google’s Machine Learning Crash Course. The notes in this post cover the “Regularization for Sparsity” module.

Best-practice: if you’re overfitting, you want to regularize.

“Convex Optimization” by Boyd and Vandenberghe, linked from multiple glossary entries, touches on many of the points made by the crash course:

  • “A problem is sparse if each constraint function depends on only a small number of the variables”
  • “Like least-squares or linear programming, there are very effective algorithms that can reliably and efficiently solve even large convex problems”, which would explain why gradient descent is a tool we use
  • Regularization is when “extra terms are added to the cost function”
  • “If the problem is sparse, or has some other exploitable structure, we can often solve problems with tens or hundreds of thousands of variables and constraint”, so it would seem performance is another motivation for regularization

Ideally, we could perform L0 normalization, but that’s non-convex, and so, NP-hard (slide 7). (I like Math is Fun’s NP-complete page🙂 As noted wrt gradient descent, we need a convex loss curve to optimize. L1 approximates L0 and is easy to compute.

Quora provides a couple intuitive explanations for L1 and L2 norms: “L2 norm there yields Euclidean distance … The L1 norm gives rise to what can be referred to as the “taxi-cab” distance”

Rorasa’s blog states “Norm may come in many forms and many names, including these popular name: Euclidean distance, Mean-squared Error, etc … Because the lack of l0-norm’s mathematical representation, l0-minimisation is regarded by computer scientist as an NP-hard problem, simply says that it’s too complex and almost impossible to solve. In many case, l0-minimisation problem is relaxed to be higher-order norm problem such as l1-minimisation and l2-minimisation.”

The glossary summarizes:

  • L1 regularization “penalizes weights in proportion to the sum of the absolute values of the weights. In models relying on sparse features, L1 regularization helps drive the weights of irrelevant or barely relevant features to exactly 0”
  • L2 regularization “penalizes weights in proportion to the sum of the squares of the weights. L2 regularization helps drive outlier weights (those with high positive or low negative values) closer to 0 but not quite to 0”

MLCC: Logistic regression

I am working through Google’s Machine Learning Crash Course. The notes in this post cover the “Logistic Regression” module.

“Logistic regression” generates a probability (a value between 0 and 1). It’s also very efficient.

Note the glossary defines logistic regression as a classification model, which is weird since it has “regression” in the name. I suspect this is explained by “You can interpret the value between 0 and 1 in either of the following two ways: … a binary classification problem … As a value to be compared against a classification threshold …”

The “sigmoid” function, aka “logistic” function/transform, produces a bounded value between 0 and 1.

Note the sigmoid function is just y = 1 / 1 + e ^ - 𝞼 where 𝞼 is our usual linear equation. I suppose we’re transforming the linear output into a logistic form.

Regularization (notes) is important in logistic regression. “Without regularization, the asymptotic nature of logistic regression would keep driving loss towards 0 in high dimensions”, esp L2 regularization and stopping early.

The “logit”, aka “log-odds”, function is the inverse of the logistic function.

The loss function for logistic regression is “log loss”.

MLCC: Classification

I am working through Google’s Machine Learning Crash Course. The notes in this post cover the “Classification” module.

New metrics for evaluating classification performance:

  • Accuracy
  • Precision
  • Recall
  • ROC
  • AUC

Accuracy

“Accuracy” simply measures percentage of correct predictions.

It fails on class-imbalance, aka “skewed class”, problems, though. Neptune AI states is bluntly: “You shouldn’t use accuracy on imbalanced problems.” Heuristic: is the percent accuracy > the imbalance? For example, if a population is 99% disease-free, an accuracy of 99% requires no intelligence. This is called the “accuracy paradox”. Precision and recall are better suited to class-imbalance problems.

Tip: calculate odds independently if possible to compare with accuracy.

Confusion matrix

A “confusion matrix”, aka “classification matrix”, quantifies predicted vs actual outcomes, which is useful for evaluating model performance.

A false positive is a “type one” error. A false negative is a “type two” error. When the cost of error is high, type two must be minimized. In other words, when the cost of error is high, maximize recall.

Precision and recall

Andrew Ng’s “Lecture 11.4 — Machine Learning System Design | Trading Off Precision And Recall” provides a helpful phrasing:

  • Precision = true positive / predicted positive
  • Recall = true positive / actual positive

Regarding the accuracy paradox, if a model simply predicts negative all the time (eg because 99% of email isn’t spam), it will fail recall and precision because it never has a true positive.

Wikipedia makes a point: “It is trivial to achieve recall of 100% by returning all documents in response to any query”

Precision and recall are important, and in tension. Classification depends on a “threshold”. Increasing the threshold increases precision, but decreases recall. Wikipedia uses surgery for a brain tumor to illustrate: a conservative approach increases the risk of false negative; an aggressive approach increases risk of false positive. Plotting the “precision-recall curve” can also help demonstrate the relationship, as demonstrated by Andrew Ng.

Wikipedia has a nice visualization differentiating precision and recall:

ROC and AUC

The “ROC curve” helps identify the best threshold.

“AUC” compares ROCs, helping identify the best model.

StatQuest’s “ROC and AUC, Clearly Explained!” states precision is a better metric than the false positive rate for class imbalance problems because it doesn’t take true negatives into account.

Keras gives us AUC for a model, but what’s the corresponding threshold? The crash course clarifies: “AUC is classification-threshold-invariant. It measures the quality of the model’s predictions irrespective of what classification threshold is chosen.” Ok, then why use anything but AUC? Neptune AI summarizes: “… use it when you care equally about positive and negative classes.”

Prediction bias

Seems like this is another way of quantifying model performance. If we know a probability of occurrence and the model produces a significantly different probability, that indicates something’s amiss.

The formal definition is: average predicted occurrence – average actual occurrence. There’s a helpful note that a model simply returning the average occurrence would have zero prediction bias, but would still be a bad model.

The crash course gives a few causes for bias. StatQuest’s “Machine Learning Fundamentals: Bias and Variance” adds another: the inability of a ML algorithm to capture the true relationship between features and labels, eg linear regression trying to capture a curved relationship.

Fix prediction bias in the model, rather than adjusting the model output.

Interesting clarification that predicted values are a probability range, but actual values are discrete, so we need to segment values and average them to make a comparison.

MLCC: Regularization

I am working through Google’s Machine Learning Crash Course. The notes in this post cover the “Regularization” module.

An earlier module focused on generalization (notes). A “generalization curve” visualizes generalization by showing loss for training data vs loss for validation data.

When training loss is less than validation loss, we’re “overfitting” to the training data, reducing generalization.

“Regularization” is the process of preventing overfitting. The TensorFlow docs also discuss regularization.

“Empirical risk minimization” refers to loss reduction using tools like gradient descent (notes).

“Structural risk minimization” refers to regularization by minimizing the complexity of the model.

The “L2 regularization” formula quantifies complexity as the sum of the squares of the feature weights.

“Lambda” aka “regularization rate” governs the amount of regularization applied. Increasing lambda strengthens regularization, resulting in a steeper histogram of weights, for example. A tool called Vizier can help optimize lambda.

Helpful phrasing from StatQuest’s “Machine Learning Fundamentals: Bias and Variance”: regularization is one technique for finding a balance between a simple model (that may have high bias) and a complex model (that may have high variability).

Exercise 1

The answer for task 1 in the first exercise, notes the “relative weight” of lines from FEATURE to OUTPUT in the playground. What is “relative weight”? 🤔 Later, the second exercise mentions “The relative thickness of each line running from FEATURES to OUTPUT represents the learned weight for that feature or feature cross. You can find the exact weight values by hovering over each line.” So, “relative weight” in this context is just referring to the weight of one line relative to another, rather than a novel concept.

The answer for task 1 states: “The lines emanating from X1 and X2 are much thicker than those coming from the feature crosses. So, the feature crosses are contributing far less to the model than the normal (uncrossed) features.” Task 2 states “If we use a model that is too complicated, such as one with too many crosses …” Later, we learn “If model complexity is a function of weights …” Is complexity a function of crosses or weights? 🤔  I guess the idea is that the additional complexity of the crosses was driving up the weight of the uncrossed features, irrespective of regularization. Running the playground with and without the cross supports this, eg ~1.5, 0.131 and 0.033, respectively, vs ~0.9 with losses 0.096 and 0.039. Running with the cross and 0.3 regularization results in ~0.3, 0.092 and 0.059. Running with just 0.3 regularization results in ~0.3, 0.093 and 0.061. So it would seem there are at least a couple, orthogonal components to “complexity”.

Exercise 2

An answer in the playground mentions: “While test loss decreases, training loss actually increases. This is expected, because you’ve added another term to the loss function to penalize complexity.” 🤔  I think this is referring to the literal addition of the complexity term in the calculation to find a weight ( minimize(loss(data|model)) + complexity(model)).

MLCC: Feature crosses

I am working through Google’s Machine Learning Crash Course. The notes in this post cover the “Feature Crosses” section.

“Feature cross”, “feature cross product” and “synthetic feature” are synonymous. A feature cross is the cross product of two features. The nonlinearity sub-section states “The term cross comes from cross product.” Thinking of it as a Cartessian product, which the glossary mentions, helps me grok what’s going on, and why it’s helpful for the example problem where examples are clustered by quarter (to consider x-y pairs), and esp the exercise involving latitude and longitude pairs.

The video states “Linear learners use linear models”. What is a “linear model”? Given “model” is synonymous with “equation” or “function”, a “linear model” is a linear equation. For example, Brilliant’s wiki states: “A linear model is an equation …” What is a “linear learner”? The video might just be stating a fact: something that learns using a linear model is a “linear learner”. For example, Amazon SageMaker’s Linear Learner docs states “The algorithm learns a linear function”.

A “linear problem” describes a relationship that can be expressed using a straight line (to divide the input data). “Nonlinear problems” cannot be expressed this way.

While trying to figure out why the exercise used an indicator_column, I found some nice TensorFlow tutorials, eg for feature crosses. In retrospect, I see the indicator_column docs state simply “Represents multi-hot representation of given categorical column.”

MLCC: Representation

I am working through Google’s Machine Learning Crash Course. The notes in this post cover the “Representation” section.

feature engineering is another topic which doesn’t seem to merit any review papers or books, or even chapters in books, but it is absolutely vital to ML success. […] Much of the success of machine learning is actually success in engineering features that a learner can understand.

Scott Locklin, in “Neglected machine learning ideas” AQI Machine Learning Mastery’s feature engineering overview

I’ve heard 80% of data science is cleaning. This section introduces a nuance: cleaning includes a step mapping raw data into a format that’s appropriate and efficient for inputting into a model. The “scrubbing” sub-section actually seems like the only thing that fits what I previously thought of as “cleaning”, eg removing human errors, addressing incomplete data, etc.

The whole section has good recommendations I can see serving as an ongoing reference. For example:

  • Good feature values should appear more than 5 or so times in a data set … avoid unique IDs
  • Keep data pure by not encoding exceptional states into a feature’s value type, eg an integer feature where -1 means undefined, aka “magic” values. Instead, use boolean flags for exceptional states.

The “Z score” scales values as follows: scaled = (value - mean) stdev. Math is Fun has a good explanation for how to derive the standard deviation, but Pandas also provides it trivially in the output from describe.

“Binning” seems similar to *-hot encoding in that we’re enabling weights for each value, although the former concerns continuous values and the latter concerns discrete values. The feature cross video supports this by referring to both in the same context.

Histograms and stats, like those output by describe, can help detect bad data.

MLCC: Generalization

I am working through Google’s Machine Learning Crash Course. The notes in this post cover [1] through [3].

The fundamental tension of machine learning is between fitting our data well, but also fitting the data as simply as possible.

A reasonable guideline: “The less complex an ML model, the more likely that a good empirical result is not just due to the peculiarities of the sample.”

[2] recommends a best-practice: divide labeled examples into “training” and “test” sets.

Never train on test data! 100% accuracy can be a symptom of that.

[3] goes further: divide labeled examples into three sets: “training”, “validation” and “test”. Simply testing against a “test” set risks overfitting to that set. Instead, iterate against the validation set, and then double-check using the test set.

A continuing impression is that TensorFlow builds in a lot of the best-practices described in this crash course. For example, splitting out a validation set and testing against it is a first-class argument to the Model.fit method.

The exercise associated with [3] is interesting. First, testing against a validation set caught a bug! Second, the bug was a default sort on the latitude column; the validation set was not a random sample.

References

  1. Google Machine Learning Crash Course: “Generalization”
  2. Google Machine Learning Crash Course: “Training and Test Sets”
  3. Google Machine Learning Crash Course: “Validation Set”

MLCC: First Steps with TensorFlow

I am working through Google’s Machine Learning Crash Course. The notes in this post cover [2].

[2] introduces Colab, NumPy, Pandas and TensorFlow.

Colab is like a hosted Jupyter notebook and provides an easy way to play with Python ML libraries, among other things.

NumPy provides performant and user-friendly collections and operations for linear algebra.

Pandas provides tools for working with “dataframes”, which are like spreadsheets in memory.

Digression into Google Sheets

I like building on my understanding. In this context, I want to learn Colab and NumPy by using them to work with the cricket chirp data introduced in [1].

[1] used cricket chirps per minute per temperature as an example, but didn’t provide raw data. Dolbear’s Law provides an equation we can use to generate data: TC = 10 + (N60 – 40) / 7 → N60 = 7 * TC – 30

Colab and NumPy provide an easy way to use this equation:

import numpy as np

# Starts by generating temp, since chirps are dependent on temp.
# Starts at 5 because Dolbear’s formula results in a negative value below 5 degrees
temps = np.arange(5,36)

# Adds noise to avoid an obviously linear relationship.
# Copies the approach from “NumPy UltraQuick Tutorial“ linked from [2].
# Sets low of -5, which limits the minimum chirps to zero.
noise = np.random.randint(low=-5, high=5, size=36)
chirps = 7 * temps - 30 + noise

# Prints CSVs, since Google Sheets knows how to split CSVs on paste.
print(','.join([str(i) for i in temps]))
print(','.join([str(i) for i in chirps]))

Example chirps per minute:

7,13,15,27,31,38,45,57,57,67,76,85,89,94,100,109,116,120,131,134,144,149,158,165,170,176,187,189,197,208,215

Note this generates synthetic data for chirps per minute, but then I’ll use them to predict temperature, ie chirps is the feature and temperature is the label.

Copy the temps and chirps CSVs. In Sheets, Edit > paste special > paste comma-separated text (CSV) as columns.

To improve readability, cut the pasted content and Edit > paste special > paste transposed to convert row data to column data.

Add column headers, select everything and then Insert > Chart.

Select “Scatter chart” for the chart type. Under Customize > Series, check the trendline box. Select “Equation” for the label to get the regression equation. Check the R2 box.

We can also use the SLOPE and INTERCEPT methods to calculate the equation.

Slope, intercept and R2, respectively, given the example chirps per minute from above:

  • 0.144
  • 4.323
  • 0.999

Unfortunately, Sheets doesn’t have MSE, which I learned about in [1], which leads me to wonder, “What’s the relationship between R2 and MSE?” Per [3], we’re better off with MSE.

Digression into SciKit

[2] introduces Pandas after NumPy, but continuing the theme of building on understanding, I’d like to perform a linear regression in Colab, rather than copy-pasting into Sheets. I’ll follow [4] and [5] and defer Pandas until I need it for TensorFlow.

import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

actual_temps = np.arange(5,36)
chirps = np.array([7,13,15,27,31,38,45,57,57,67,76,85,89,94,100,109,116,120,131,134,144,149,158,165,170,176,187,189,197,208,215])

model = linear_model.LinearRegression()
model.fit(chirps[:, np.newaxis], actual_temps)

predicted_temps = model.predict(chirps[:, np.newaxis])

plt.scatter(chirps, actual_temps)
plt.plot(chirps, predicted_temps)

# Starts the y-axis at zero, even though the data starts at 5
plt.ylim(0)

print('Slope: %.3f' % model.coef_)
print('Intercept: %.3f' % model.intercept_)
print('MSE: %.3f' % mean_squared_error(actual_temps, predicted_temps))
print('R2: %.3f' % r2_score(actual_temps, predicted_temps))

Slope, intercept, MSE and R2, respectively:

  • 0.144
  • 4.323
  • 0.085
  • 0.999

Note SciKit can calculate MSE and R2. Perhaps in line with [3], note MSE is non-zero, but R2 close to 100% 🤔 

As expected, Sheets is great for common stuff, but Colab/Jupyter shines for arbitrary calculation.

TensorFlow

Coincidentally, TensorFlow’s fifth birthday was just a couple days ago 🥳

Continuing the theme of building on experience, I’m using the cricket chirp data for the synthetic exercise:

my_feature = ([float(i) for i in [7,13,15,27,31,38,45,57,57,67,76,85,89,94,100,109,116,120,131,134,144,149,158,165,170,176,187,189,197,208,215]])
my_label   = ([float(i) for i in range(5,36)])

The following settings enabled the cricket chirp data to converge with an RMSE ~ 0.8, which seems like a sweet spot of accuracy vs training time:

  • Learning: 0.01
  • Epochs: 50
  • Batch size: 1

Decreasing the learning rate (eg 0.001) and increasing the epochs (eg 500) converges with an RMSE ~0.5, but takes forever. Increasing the batch increases choppiness of the error tail.

The summary at the bottom of the synthetic data exercise seems generally useful:

  • “Training loss should steadily decrease, steeply at first, and then more slowly until the slope of the curve reaches or approaches zero.
  • If the training loss does not converge, train for more epochs.
  • If the training loss decreases too slowly, increase the learning rate. Note that setting the learning rate too high may also prevent training loss from converging.
  • If the training loss varies wildly (that is, the training loss jumps around), decrease the learning rate.
  • Lowering the learning rate while increasing the number of epochs or the batch size is often a good combination.
  • Setting the batch size to a very small batch number can also cause instability. First, try large batch size values. Then, decrease the batch size until you see degradation.
  • For real-world datasets consisting of a very large number of examples, the entire dataset might not fit into memory. In such cases, you’ll need to reduce the batch size to enable a batch to fit into memory.”

For the real data, there’s a note about the “max” being anomalous relative to the different percentiles, which makes sense, but is a little abstract. The plot does a good job showing outliers.

Interesting that the RMSE for the real data is ~100, rather than the zero I was going for with the synthetic data. I guess the point is that we’re trying to minimize loss, rather than eliminate it.

[2] uses California housing data, but we can browse other datasets at https://datasetsearch.research.google.com/.

Great tip to use corr to see which features correlate with a label, as an alternative to trial and error hyperparameter tuning.

References

  1. Google Machine Learning Crash Course: “Descending into ML”
  2. Google Machine Learning Crash Course: “First Steps with TensorFlow”
  3. University of Virginia Library: “Is R-squared Useless?”
  4. Python Data Science Handbook: “In Depth: Linear Regression” excerpt
  5. SciKit: “Linear Regression Example”

MLCC: Gradient descent

I am working through Google’s Machine Learning Crash Course. The notes in this post cover [2].

Earlier, I explored simplistic linear regression, largely based on [1]. The next section of the crash course ([2]) dives into “gradient descent” (GD), which raises the question “What’s wrong with the linear regression we just learned?” In short, the technique we just learned, Ordinary Least Squares (OLS), does not scale.

[3] clarifies linear regression can take a few forms depending input and processing constraints. Among these forms, OLS concerns one or more inputs where “all of the data must be available and you must have enough memory to fit the data and perform matrix operations” and uses least squares to find the best line. GD concerns “a very large dataset either in the number of rows or the number of columns that may not fit into memory.” As described by [4], OLS doesn’t scale. GD scales by finding a “numerical approximation … by iterative method”.

[2] introduces GD by descending a parabola, but it’s unclear how we transitioned from talking about straight lines in [1] to parabolas. The distinction is that we’re now focusing on loss functions. (To be fair, in retrospect, the title is “Reducing loss”🤦‍♂️) [2] asserts “For the kind of regression problems we’ve been examining, the resulting plot of loss vs. w1 will always be convex”, ie a parabola. OLS takes all the data and computes an optimal line, but GD iteratively generates lines and determines whether one is optimal by comparing the loss to the previous iteration.

[1] introduced the idea of quantifying the accuracy of a regression by calculating the loss. For example, it mentioned Mean Squared Error as a common loss function. [5] clarifies that Mean Squared Error is an exponential function. This provides helpful context for [2]’s definition of “gradient” as the derivative of the loss function.

I like the summary statement from [5]

The goal of any Machine Learning Algorithm is to minimize the Cost Function

[5] uses the interactive exercise from [2]. It’s reassuring to see convergence 😉

[4] presents a good example of a team trying to find the highest peak in a mountainous area by parachuting randomly over the range and reporting their local max daily. I can see how that would scale well for a large data set. Reminds me of MapReduce.

This example is a bit counter-intuitive, though, in that GD is trying to find a minimum (loss) rather than a maximum. It’d be better phrased as trying to find the deepest valley. Anyway, it states “Our aim is to reach the minima which is the valley bottom. So our gradient should be negative always … So if at our initial weights, the slope is negative, we are in the right direction”, which explains the “descent” in “gradient descent”.

[4] (like [2]) describes three forms of GD:

  1. Batch
  2. Stochastic
  3. Mini Batch

[2] defines “a batch” as “the total number of examples you use to calculate the gradient in a single iteration.” Presumably, it’s referring to Batch GD when it says “So far, we’ve assumed that the batch has been the entire data set.”

[2] describes Stochastic as picking one example at random for each iteration, which would take forever and may operate on redundant data, which is common in large data sets.

[2] states Mini Batch “reduces the amount of noise in SGD but is still more efficient than full-batch” because it uses batches of 10-1000 random examples, and that Mini Batch is what’s used in practice.

When do we stop iterating? [2] states “you iterate until overall loss stops changing or at least changes extremely slowly. When that happens, we say that the model has converged.”

To summarize:

  1. Initialize with arbitrary weights
  2. Generate a model
  3. Sample (labeled) examples
  4. Input sample into the model
  5. Calculate the loss
  6. Compare the new loss with the previous loss
  7. If loss is decreasing
    1. Add the step value to the weight
    2. Repeat from step 2

References

  1. Google Machine Learning Crash Course: “Descending into ML”
  2. Google Machine Learning Crash Course: “Reducing loss”
  3. Machine Learning Mastery: “Linear Regression for Machine Learning”
  4. Towards Data Science: “Optimization: Ordinary Least Squares Vs. Gradient Descent — from scratch”
  5. Towards Data Science: “Understanding the Mathematics behind Gradient Descent”