ML Foundations: week 2

Coursera’s Lab was running slowly, so explored Google’s Colab as an alternative.

A few nice features: CPU and RAM usage indicators let me know if I’m close to a limit; the run, create and move buttons on each cell are convenient.

In Coursera, download the FND02-NB01.ipynb.zip and home_data.sframe.zip files and unzip.

In Colab, click on “File > Upload notebook” and upload the unzipped notebook.

Add a cell to install Turi Create:

%%bash
pip install turicreate

Add another cell to authorize Colab to read files from Drive:

from google.colab import drive
drive.mount('/content/drive')

In Drive, select “upload folder” and upload the unzipped folder.

In Colab’s left rail, click on the little the stylized folder icon (🗂) and browse Drive for the uploaded folder. Right-click on the folder and select “Copy path”.

Update the SFrame creation to use the copied path:

sales = turicreate.SFrame('/content/drive/MyDrive/home_data.sframe')

Credit to the “Bonus Method — My Drive” section of “Get Started: 3 Ways to Load CSV files into Colab” for describing the basics.

Aaand of course now that I’ve set up Colab, I see Coursera’s Lab is running faster 🤷‍♂️

Out of curiosity, I see the intercept is negative, indicating buyers require a minimum square footage. Solving for x when y=0, I see it’s ~180. I can plug that back into the model:

sqft_model.predict([{'sqft_living': 180}])

Control vs data planes

I recently became aware of a helpful dichotomy: control vs data plane. The former governs how the latter should be delivered.

I believe these terms come from the world of networking, but they’re now entering the world of application engineering via DevOps.

For example, I work on a product that delivers targeted configuration to apps. In this context, the targeting logic is the control plane, and the resulting configuration is the data plane. For contrast, the RESTful perspective would describe both as resources.

In this context, I can see if other patterns might apply. In particular, the best-practice of a declarative control plane has been helpful lately. As Azure’s introduction to Infrastructure as Code states, the goal is to specify “what an environment requires and not necessarily the how.” Collocating control with code simplifies reasoning and minimizes the cost of switching between application and infrastructure logic, similar to the benefits of collocating documentation with code.

Praise for Markdown eng docs

Google has a technical documentation system called “g3doc”. The “The Knowledge: Towards a Culture of Engineering Documentation” presentation at SRECon16 described it well, so this post just highlights a few details:

  1. Documentation is collocated with code
  2. Documentation is rendered from code-like Markdown

The first point enables me to include documentation changes and code changes in the same commit.

The second point is appealing because it reduces the cost of context switching between code and documentation. For example, I can edit both in the same editor.

I think part of the appeal is Google’s monorepo. Everything is path-indexed, but things under a “g3doc” dir are rendered into web pages. Searching the repo returns results for code and docs.

Outside of Google, I think Github’s rendering of Mardown content is comparable.

MLCC: Neural Networks

I am working through Google’s Machine Learning Crash Course. The notes in this post cover the “Neural Networks” module.

Does “deep learning” imply neural networks?

The introductory video refers to “deep neural networks”, so I’m wondering what the relationship is between deep learning and neural networks.

Yes, according to Quora’s “Does deep learning always mean neural network or can include other ML techniques?”.

“To give you some context, modern Convolutional Networks contain on orders of 100 million parameters and are usually made up of approximately 10-20 layers (hence deep learning)” – https://cs231n.github.io/neural-networks-1/

“Deep Learning is simply a subset of the architectures (or templates) that employs ‘neural networks’” – https://towardsdatascience.com/intuitive-deep-learning-part-1a-introduction-to-neural-networks-aaeb3a1500df (TDS)

“Deep learning” in Google’s glossary links to “deep model”: “A type of neural network containing multiple hidden layers.”

“However, until 2006 we didn’t know how to train neural networks to surpass more traditional approaches, except for a few specialized problems. What changed in 2006 was the discovery of techniques for learning in so-called deep neural networks.” – http://neuralnetworksanddeeplearning.com/about.html

Toward’s Data Science’s “Intuitive Deep Learning Part 1a: Introduction to Neural Networks” clarifies “deep learning” is a subset of machine learning. I guess they’re both “learning”. I like the comparison of an algorithm to a recipe, and in this context, ML optimizes a recipe. Deep learning is a subset of optimization techniques.

When to use neural networks?

Small data with linear relationships → LSR

Large data with linear relationships → gradient descent

Large data with simple, nonlinear relationships → feature crosses

Large data with complex, nonlinear relationships → NN

“Neural nets will give us a way to learn nonlinear models without the use of explicit feature crosses” – https://developers.google.com/machine-learning/crash-course/introduction-to-neural-networks/playground-exercises

“Neural networks, a beautiful biologically-inspired programming paradigm which enables a computer to learn from observational data” – http://neuralnetworksanddeeplearning.com/index.html

NN “have the flexibility to model many complicated relationships between input and output”- https://towardsdatascience.com/intuitive-deep-learning-part-1a-introduction-to-neural-networks-aaeb3a1500df

“That’s not to say that neural networks aren’t good at solving simpler problems. They are. But so are many other algorithms. The complexity, resource-intensiveness and lack of interpretability in neural networks is sometimes a necessary evil, but it’s only warranted when simpler methods are inapplicable” – https://www.quora.com/What-kinds-of-machine-learning-problems-are-neural-networks-particularly-good-at-solving

Why are there multiple layers?

“each layer is effectively learning a more complex, higher-level function over the raw inputs” – https://developers.google.com/machine-learning/crash-course/introduction-to-neural-networks/anatomy

“A single-layer neural network can only be used to represent linearly separable functions … Most problems that we are interested in solving are not linearly separable.” – https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/

The universal approximation theory states that one hidden layer is sufficient for any problem – https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/

“How many hidden layers? Well if your data is linearly separable (which you often know by the time you begin coding a NN) then you don’t need any hidden layers at all. Of course, you don’t need an NN to resolve your data either, but it will still do the job.” – https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw

“One hidden layer is sufficient for the large majority of problems.” – https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw

“Even for those functions that can be learned via a sufficiently large one-hidden-layer MLP, it can be more efficient to learn it with two (or more) hidden layers” – https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/

“Multi-layer” implies at least one hidden layer: “It has an input layer that connects to the input variables, one or more hidden layers” – https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/

Chris Olah’s “Neural Networks, Manifolds and Topology”, linked from the crash course, visualizes how data sets intersecting in n dimensions may be disjoint in n + 1 dimensions, which enables a linear solution. Other than that, though, Olah’s article was over my head. Articles like TDS are more my speed.

Why are some layers called “hidden”?

“The interior layers are sometimes called “hidden layers” because they are not directly observable from the systems inputs and outputs.” – https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/

How many layers do I need?

Task 4 in the exercise recommends playing around with the hyperparameters to get a certain loss, but the combinatorial complexity makes me wonder if there’s an intuitive way to think about the role of layers and neurons. 🤔

“Regardless of the heuristics you might encounter, all answers will come back to the need for careful experimentation to see what works best for your specific dataset” – https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/

“In sum, for most problems, one could probably get decent performance (even without a second optimization step) by setting the hidden layer configuration using just two rules: (i) number of hidden layers equals one; and (ii) the number of neurons in that layer is the mean of the neurons in the input and output layers.” – https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw

“3 neurons are enough because the XOR function can be expressed as a combination of 3 half-planes (ReLU activation)” – https://developers.google.com/machine-learning/crash-course/introduction-to-neural-networks/playground-exercises Seems narrowing the problem space to ReLU enables some deterministic optimization.

“The sigmoid and hyperbolic tangent activation functions cannot be used in networks with many layers due to the vanishing gradient problem” – https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/

“use as big of a neural network as your computational budget allows, and use other regularization techniques to control overfitting” – https://cs231n.github.io/neural-networks-1/#arch

“a model with 1 neuron in the first hidden layer cannot learn a good model no matter how deep it is. This is because the output of the first layer only varies along one dimension (usually a diagonal line), which isn’t enough to model this data set well” – https://developers.google.com/machine-learning/crash-course/introduction-to-neural-networks/playground-exercises

“A single layer with more than 3 neurons has more redundancy, and thus is more likely to converge to a good model” – https://developers.google.com/machine-learning/crash-course/introduction-to-neural-networks/playground-exercises

Two hidden layers with eight neurons in the first and two in the second performed well (~0.15 loss) on repeated runs.

Heuristics from spiral solution video:

  1. Tune number of layers and nodes. Max neurons in the first layer, tapering down a couple layers to the output is a reasonable start. Each neuron takes time to train, though, so reduce total neurons if training is too slow. This is reinforced by the practice exercise, which started with two layers of 20 and 12 neurons, and then tried to reduce the number of neurons while keeping loss stable.
  2. Reduce the learning rate to smooth loss curve
  3. Add regularization to further smooth loss curve
  4. Feature engineering helps with noisy data
  5. Try different activation functions. Ultimately, tanh had the best fit
  6. Iterate from 1

Even after all this, tuning hyper parameters still seems combinatorially complex.

Activation functions

A neural net consists of layers. Nodes in the bottom layer are linear equations. Nodes in a “hidden” layer transform a linear node into a non-linear node using an “activation function”. The crash course states “any mathematical function can serve as an activation function”.

A sigmoid is an example of an activation function. I remember from the module on logistic regression (notes) that we used a sigmoid to transform a linear equation into a probability.

Why is it called a “neuron”?

The glossary definition for “neuron” is pretty good: 1) “taking in multiple input values and generating one output value”, and 2) ”The neuron calculates the output value by applying an activation function.” Aside: this reminds me of lambda architecture. I appreciate TDS clarifying neurons “often take some linear combination of the inputs”, like w1x1 + w2x2 + w3x3. I suppose this is what the glossary means by “a weighted sum of input values”.

TDS references a single image from the biological motivations section of Stanford’s CS231n, but I find both the images from that section useful for comparison.

I like TDS’ definition of a “layer” as “a “neural network” is simply made out of layers of neurons, connected in a way that the input of one layer of neuron is the output of the previous layer of neurons”. In that context, the hidden layer diagrams from the crash course makes sense.

Norvig’s summary of ML for software engineers

Peter Norvig summarized the value of ML from a software engineering perspective in his “Introduction to Machine Learning” for Google’s Machine Learning Crash Course:

First, it gives you a tool to reduce the time you spend programming … Second, it will allow you to customize your products, making them better for specific groups of people … And third, machine learning lets you solve problems that you, as a programmer, have no idea how to do by hand.

From my perspective, the first two can be rephrased as:

  1. Models add a new dimension to code reuse
  2. For a class of problems, training models scales better than hand-writing code

There’s also a fourth point linked from the bottom of the intro:

Rule #1: Don’t be afraid to launch a product without machine learning

That fourth point reminds me of the “build” vs “grow” domains – until we’ve built a product that lots of people find useful, statistics-based growth tools, like large-scale AB testing, can be relatively high-cost, low-value.We might even say such optimizations only make sense once we have more users than can be efficiently contacted directly. Put another way, if we only have one user, and she says she only wants to see articles about sports, we don’t need ML to predict her interests.

I think about these four points a lot, almost like a koan. They provide a helpful anchor as I try to distill a large amount of theory into tools I can apply to the problems I’m familiar with.

MLCC: Regularization for sparsity

I am working through Google’s Machine Learning Crash Course. The notes in this post cover the “Regularization for Sparsity” module.

Best-practice: if you’re overfitting, you want to regularize.

“Convex Optimization” by Boyd and Vandenberghe, linked from multiple glossary entries, touches on many of the points made by the crash course:

  • “A problem is sparse if each constraint function depends on only a small number of the variables”
  • “Like least-squares or linear programming, there are very effective algorithms that can reliably and efficiently solve even large convex problems”, which would explain why gradient descent is a tool we use
  • Regularization is when “extra terms are added to the cost function”
  • “If the problem is sparse, or has some other exploitable structure, we can often solve problems with tens or hundreds of thousands of variables and constraint”, so it would seem performance is another motivation for regularization

Ideally, we could perform L0 normalization, but that’s non-convex, and so, NP-hard (slide 7). (I like Math is Fun’s NP-complete page🙂 As noted wrt gradient descent, we need a convex loss curve to optimize. L1 approximates L0 and is easy to compute.

Quora provides a couple intuitive explanations for L1 and L2 norms: “L2 norm there yields Euclidean distance … The L1 norm gives rise to what can be referred to as the “taxi-cab” distance”

Rorasa’s blog states “Norm may come in many forms and many names, including these popular name: Euclidean distance, Mean-squared Error, etc … Because the lack of l0-norm’s mathematical representation, l0-minimisation is regarded by computer scientist as an NP-hard problem, simply says that it’s too complex and almost impossible to solve. In many case, l0-minimisation problem is relaxed to be higher-order norm problem such as l1-minimisation and l2-minimisation.”

The glossary summarizes:

  • L1 regularization “penalizes weights in proportion to the sum of the absolute values of the weights. In models relying on sparse features, L1 regularization helps drive the weights of irrelevant or barely relevant features to exactly 0”
  • L2 regularization “penalizes weights in proportion to the sum of the squares of the weights. L2 regularization helps drive outlier weights (those with high positive or low negative values) closer to 0 but not quite to 0”

MLCC: Logistic regression

I am working through Google’s Machine Learning Crash Course. The notes in this post cover the “Logistic Regression” module.

“Logistic regression” generates a probability (a value between 0 and 1). It’s also very efficient.

Note the glossary defines logistic regression as a classification model, which is weird since it has “regression” in the name. I suspect this is explained by “You can interpret the value between 0 and 1 in either of the following two ways: … a binary classification problem … As a value to be compared against a classification threshold …”

The “sigmoid” function, aka “logistic” function/transform, produces a bounded value between 0 and 1.

Note the sigmoid function is just y = 1 / 1 + e ^ - 𝞼 where 𝞼 is our usual linear equation. I suppose we’re transforming the linear output into a logistic form.

Regularization (notes) is important in logistic regression. “Without regularization, the asymptotic nature of logistic regression would keep driving loss towards 0 in high dimensions”, esp L2 regularization and stopping early.

The “logit”, aka “log-odds”, function is the inverse of the logistic function.

The loss function for logistic regression is “log loss”.