# MLCC: Neural Networks

I am working through Googleâ€™s Machine Learning Crash Course. The notes in this post cover the â€śNeural Networksâ€ť module.

## Does â€śdeep learningâ€ť imply neural networks?

The introductory video refers to â€śdeep neural networksâ€ť, so Iâ€™m wondering what the relationship is between deep learning and neural networks.

â€śTo give you some context, modern Convolutional Networks contain on orders of 100 million parameters and are usually made up of approximately 10-20 layers (hence deep learning)â€ť – https://cs231n.github.io/neural-networks-1/

â€śDeep Learning is simply a subset of the architectures (or templates) that employs ‘neural networks’â€ť – https://towardsdatascience.com/intuitive-deep-learning-part-1a-introduction-to-neural-networks-aaeb3a1500df (TDS)

â€śDeep learningâ€ť in Google’s glossary links to â€śdeep modelâ€ť: â€śA type of neural network containing multiple hidden layers.â€ť

â€śHowever, until 2006 we didn’t know how to train neural networks to surpass more traditional approaches, except for a few specialized problems. What changed in 2006 was the discovery of techniques for learning in so-called deep neural networks.â€ť – http://neuralnetworksanddeeplearning.com/about.html

Towardâ€™s Data Scienceâ€™s â€śIntuitive Deep Learning Part 1a: Introduction to Neural Networksâ€ť clarifies â€śdeep learningâ€ť is a subset of machine learning. I guess theyâ€™re both â€ślearningâ€ť. I like the comparison of an algorithm to a recipe, and in this context, ML optimizes a recipe. Deep learning is a subset of optimization techniques.

## When to use neural networks?

Small data with linear relationships â†’ LSR

Large data with linear relationships â†’ gradient descent

Large data with simple, nonlinear relationships â†’ feature crosses

Large data with complex, nonlinear relationships â†’ NN

â€śNeural nets will give us a way to learn nonlinear models without the use of explicit feature crossesâ€ť – https://developers.google.com/machine-learning/crash-course/introduction-to-neural-networks/playground-exercises

â€śNeural networks, a beautiful biologically-inspired programming paradigm which enables a computer to learn from observational dataâ€ť – http://neuralnetworksanddeeplearning.com/index.html

NN â€śhave the flexibility to model many complicated relationships between input and outputâ€ť- https://towardsdatascience.com/intuitive-deep-learning-part-1a-introduction-to-neural-networks-aaeb3a1500df

â€śThatâ€™s not to say that neural networks arenâ€™t good at solving simpler problems. They are. But so are many other algorithms. The complexity, resource-intensiveness and lack of interpretability in neural networks is sometimes a necessary evil, but itâ€™s only warranted when simpler methods are inapplicableâ€ť – https://www.quora.com/What-kinds-of-machine-learning-problems-are-neural-networks-particularly-good-at-solving

## Why are there multiple layers?

â€śeach layer is effectively learning a more complex, higher-level function over the raw inputsâ€ť – https://developers.google.com/machine-learning/crash-course/introduction-to-neural-networks/anatomy

â€śA single-layer neural network can only be used to represent linearly separable functions â€¦ Most problems that we are interested in solving are not linearly separable.â€ť – https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/

The universal approximation theory states that one hidden layer is sufficient for any problem – https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/

â€śHow many hidden layers? Well if your data is linearly separable (which you often know by the time you begin coding a NN) then you don’t need any hidden layers at all. Of course, you don’t need an NN to resolve your data either, but it will still do the job.â€ť – https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw

â€śOne hidden layer is sufficient for the large majority of problems.â€ť – https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw

â€śEven for those functions that can be learned via a sufficiently large one-hidden-layer MLP, it can be more efficient to learn it with two (or more) hidden layersâ€ť – https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/

â€śMulti-layerâ€ť implies at least one hidden layer: â€śIt has an input layer that connects to the input variables, one or more hidden layersâ€ť – https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/

Chris Olahâ€™s â€śNeural Networks, Manifolds and Topologyâ€ť, linked from the crash course, visualizes how data sets intersecting in n dimensions may be disjoint in n + 1 dimensions, which enables a linear solution. Other than that, though, Olahâ€™s article was over my head. Articles like TDS are more my speed.

## Why are some layers called â€śhiddenâ€ť?

â€śThe interior layers are sometimes called â€śhidden layersâ€ť because they are not directly observable from the systems inputs and outputs.â€ť – https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/

## How many layers do I need?

Task 4 in the exercise recommends playing around with the hyperparameters to get a certain loss, but the combinatorial complexity makes me wonder if thereâ€™s an intuitive way to think about the role of layers and neurons. đź¤”

â€śRegardless of the heuristics you might encounter, all answers will come back to the need for careful experimentation to see what works best for your specific datasetâ€ť – https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/

â€śIn sum, for most problems, one could probably get decent performance (even without a second optimization step) by setting the hidden layer configuration using just two rules: (i) number of hidden layers equals one; and (ii) the number of neurons in that layer is the mean of the neurons in the input and output layers.â€ť – https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw

â€ś3 neurons are enough because the XOR function can be expressed as a combination of 3 half-planes (ReLU activation)â€ť – https://developers.google.com/machine-learning/crash-course/introduction-to-neural-networks/playground-exercises Seems narrowing the problem space to ReLU enables some deterministic optimization.

â€śThe sigmoid and hyperbolic tangent activation functions cannot be used in networks with many layers due to the vanishing gradient problemâ€ť – https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/

â€śuse as big of a neural network as your computational budget allows, and use other regularization techniques to control overfittingâ€ť – https://cs231n.github.io/neural-networks-1/#arch

â€śa model with 1 neuron in the first hidden layer cannot learn a good model no matter how deep it is. This is because the output of the first layer only varies along one dimension (usually a diagonal line), which isn’t enough to model this data set wellâ€ť – https://developers.google.com/machine-learning/crash-course/introduction-to-neural-networks/playground-exercises

â€śA single layer with more than 3 neurons has more redundancy, and thus is more likely to converge to a good modelâ€ť – https://developers.google.com/machine-learning/crash-course/introduction-to-neural-networks/playground-exercises

Two hidden layers with eight neurons in the first and two in the second performed well (~0.15 loss) on repeated runs.

Heuristics from spiral solution video:

1. Tune number of layers and nodes. Max neurons in the first layer, tapering down a couple layers to the output is a reasonable start. Each neuron takes time to train, though, so reduce total neurons if training is too slow. This is reinforced by the practice exercise, which started with two layers of 20 and 12 neurons, and then tried to reduce the number of neurons while keeping loss stable.
2. Reduce the learning rate to smooth loss curve
3. Add regularization to further smooth loss curve
4. Feature engineering helps with noisy data
5. Try different activation functions. Ultimately, tanh had the best fit
6. Iterate from 1

Even after all this, tuning hyper parameters still seems combinatorially complex.

## Activation functions

A neural net consists of layers. Nodes in the bottom layer are linear equations. Nodes in a â€śhiddenâ€ť layer transform a linear node into a non-linear node using an â€śactivation functionâ€ť. The crash course states â€śany mathematical function can serve as an activation functionâ€ť.

A sigmoid is an example of an activation function. I remember from the module on logistic regression (notes) that we used a sigmoid to transform a linear equation into a probability.

## Why is it called a â€śneuronâ€ť?

The glossary definition for â€śneuronâ€ť is pretty good: 1) â€śtaking in multiple input values and generating one output valueâ€ť, and 2) â€ťThe neuron calculates the output value by applying an activation function.â€ť Aside: this reminds me of lambda architecture. I appreciate TDS clarifying neurons â€śoften take some linear combination of the inputsâ€ť, like w1x1 + w2x2 + w3x3. I suppose this is what the glossary means by â€śa weighted sum of input valuesâ€ť.

TDS references a single image from the biological motivations section of Stanfordâ€™s CS231n, but I find both the images from that section useful for comparison.

I like TDS’ definition of a â€ślayerâ€ť as â€śa â€śneural networkâ€ť is simply made out of layers of neurons, connected in a way that the input of one layer of neuron is the output of the previous layer of neuronsâ€ť. In that context, the hidden layer diagrams from the crash course makes sense.