*Introducing MLOps* by Mark Treveil, et al, provides a thorough, but relatively non-technical, enterprise-level introduction to MLOps. I, being at a big company and new to ML, found this book helpful for developing a big picture for how to build and maintain ML infrastructure.

# Tag: machine learning

# Data Pipeline Pocket Reference

The *Data Pipeline Pocket Reference* by James Densmore is a practical overview of pipeline concepts and terminology. It demonstrates most concepts using framework-agnostic Python scripts. It also provides a good introduction to MLOps by recommending popular solutions to common problems, like Apache Airflow for orchestration. I’d recommend it to anyone ramping up on MLOps.

# MLCC: Neural Networks

I am working through Googleâ€™s Machine Learning Crash Course. The notes in this post cover the â€śNeural Networksâ€ť module.

## Does â€śdeep learningâ€ť imply neural networks?

The introductory video refers to â€śdeep neural networksâ€ť, so Iâ€™m wondering what the relationship is between deep learning and neural networks.

Yes, according to Quoraâ€™s â€śDoes deep learning always mean neural network or can include other ML techniques?â€ť.

â€śTo give you some context, modern Convolutional Networks contain on orders of 100 million parameters and are usually made up of approximately 10-20 layers (hence *deep learning*)â€ť – https://cs231n.github.io/neural-networks-1/

â€śDeep Learning is simply a subset of the architectures (or templates) that employs ‘neural networks’â€ť – https://towardsdatascience.com/intuitive-deep-learning-part-1a-introduction-to-neural-networks-aaeb3a1500df (TDS)

â€śDeep learningâ€ť in Google’s glossary links to â€śdeep modelâ€ť: â€śA type of neural network containing multiple hidden layers.â€ť

â€śHowever, until 2006 we didn’t know how to train neural networks to surpass more traditional approaches, except for a few specialized problems. What changed in 2006 was the discovery of techniques for learning in so-called deep neural networks.â€ť – http://neuralnetworksanddeeplearning.com/about.html

Towardâ€™s Data Scienceâ€™s â€śIntuitive Deep Learning Part 1a: Introduction to Neural Networksâ€ť clarifies â€śdeep learningâ€ť is a subset of machine learning. I guess theyâ€™re both â€ślearningâ€ť. I like the comparison of an algorithm to a recipe, and in this context, ML optimizes a recipe. Deep learning is a subset of optimization techniques.

## When to use neural networks?

Small data with linear relationships â†’ LSR

Large data with linear relationships â†’ gradient descent

Large data with simple, nonlinear relationships â†’ feature crosses

Large data with complex, nonlinear relationships â†’ NN

â€śNeural nets will give us a way to learn nonlinear models without the use of explicit feature crossesâ€ť – https://developers.google.com/machine-learning/crash-course/introduction-to-neural-networks/playground-exercises

â€śNeural networks, a beautiful biologically-inspired programming paradigm which enables a computer to learn from observational dataâ€ť – http://neuralnetworksanddeeplearning.com/index.html

NN â€śhave the flexibility to model many complicated relationships between input and outputâ€ť- https://towardsdatascience.com/intuitive-deep-learning-part-1a-introduction-to-neural-networks-aaeb3a1500df

â€śThatâ€™s not to say that neural networks arenâ€™t good at solving simpler problems. They are. But so are many other algorithms. The complexity, resource-intensiveness and lack of interpretability in neural networks is sometimes a necessary evil, but itâ€™s only warranted when simpler methods are inapplicableâ€ť – https://www.quora.com/What-kinds-of-machine-learning-problems-are-neural-networks-particularly-good-at-solving

## Why are there multiple layers?

â€śeach layer is effectively learning a more complex, higher-level function over the raw inputsâ€ť – https://developers.google.com/machine-learning/crash-course/introduction-to-neural-networks/anatomy

â€śA single-layer neural network can only be used to represent linearly separable functions â€¦ Most problems that we are interested in solving are not linearly separable.â€ť – https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/

The universal approximation theory states that one hidden layer is sufficient for any problem – https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/

â€śHow many hidden layers? Well if your data is linearly separable (which you often know by the time you begin coding a NN) then you don’t need any hidden layers at all. Of course, you don’t need an NN to resolve your data either, but it will still do the job.â€ť – https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw

â€śOne hidden layer is sufficient for the large majority of problems.â€ť – https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw

â€śEven for those functions that can be learned via a sufficiently large one-hidden-layer MLP, it can be more efficient to learn it with two (or more) hidden layersâ€ť – https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/

â€śMulti-layerâ€ť implies at least one hidden layer: â€śIt has an input layer that connects to the input variables, one or more hidden layersâ€ť – https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/

Chris Olahâ€™s â€śNeural Networks, Manifolds and Topologyâ€ť, linked from the crash course, visualizes how data sets intersecting in n dimensions may be disjoint in n + 1 dimensions, which enables a linear solution. Other than that, though, Olahâ€™s article was over my head. Articles like TDS are more my speed.

## Why are some layers called â€śhiddenâ€ť?

â€śThe interior layers are sometimes called â€śhidden layersâ€ť because they are not directly observable from the systems inputs and outputs.â€ť – https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/

## How many layers do I need?

Task 4 in the exercise recommends playing around with the hyperparameters to get a certain loss, but the combinatorial complexity makes me wonder if thereâ€™s an intuitive way to think about the role of layers and neurons. đź¤”

â€śRegardless of the heuristics you might encounter, all answers will come back to the need for careful experimentation to see what works best for your specific datasetâ€ť – https://machinelearningmastery.com/how-to-configure-the-number-of-layers-and-nodes-in-a-neural-network/

â€śIn sum, for most problems, one could probably get decent performance (even without a second optimization step) by setting the hidden layer configuration using just two rules: (i) number of hidden layers equals one; and (ii) the number of neurons in that layer is the mean of the neurons in the input and output layers.â€ť – https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw

â€ś3 neurons are enough because the XOR function can be expressed as a combination of 3 half-planes (ReLU activation)â€ť – https://developers.google.com/machine-learning/crash-course/introduction-to-neural-networks/playground-exercises Seems narrowing the problem space to ReLU enables some deterministic optimization.

â€śThe sigmoid and hyperbolic tangent activation functions cannot be used in networks with many layers due to the vanishing gradient problemâ€ť – https://machinelearningmastery.com/rectified-linear-activation-function-for-deep-learning-neural-networks/

â€śuse as big of a neural network as your computational budget allows, and use other regularization techniques to control overfittingâ€ť – https://cs231n.github.io/neural-networks-1/#arch

â€śa model with 1 neuron in the first hidden layer cannot learn a good model no matter how deep it is. This is because the output of the first layer only varies along one dimension (usually a diagonal line), which isn’t enough to model this data set wellâ€ť – https://developers.google.com/machine-learning/crash-course/introduction-to-neural-networks/playground-exercises

â€śA single layer with more than 3 neurons has more redundancy, and thus is more likely to converge to a good modelâ€ť – https://developers.google.com/machine-learning/crash-course/introduction-to-neural-networks/playground-exercises

Two hidden layers with eight neurons in the first and two in the second performed well (~0.15 loss) on repeated runs.

Heuristics from spiral solution video:

- Tune number of layers and nodes. Max neurons in the first layer, tapering down a couple layers to the output is a reasonable start. Each neuron takes time to train, though, so reduce total neurons if training is too slow. This is reinforced by the practice exercise, which started with two layers of 20 and 12 neurons, and then tried to reduce the number of neurons while keeping loss stable.
- Reduce the learning rate to smooth loss curve
- Add regularization to further smooth loss curve
- Feature engineering helps with noisy data
- Try different activation functions. Ultimately, tanh had the best fit
- Iterate from 1

Even after all this, tuning hyper parameters still seems combinatorially complex.

## Activation functions

A neural net consists of layers. Nodes in the bottom layer are linear equations. Nodes in a â€śhiddenâ€ť layer transform a linear node into a non-linear node using an â€śactivation functionâ€ť. The crash course states â€śany mathematical function can serve as an activation functionâ€ť.

A sigmoid is an example of an activation function. I remember from the module on logistic regression (notes) that we used a sigmoid to transform a linear equation into a probability.

## Why is it called a â€śneuronâ€ť?

The glossary definition for â€śneuronâ€ť is pretty good: 1) â€śtaking in multiple input values and generating one output valueâ€ť, and 2) â€ťThe neuron calculates the output value by applying an activation function.â€ť Aside: this reminds me of lambda architecture. I appreciate TDS clarifying neurons â€śoften take some linear combination of the inputsâ€ť, like w1x1 + w2x2 + w3x3. I suppose this is what the glossary means by â€śa weighted sum of input valuesâ€ť.

TDS references a single image from the biological motivations section of Stanfordâ€™s CS231n, but I find both the images from that section useful for comparison.

I like TDS’ definition of a â€ślayerâ€ť as â€śa â€śneural networkâ€ť is simply made out of layers of neurons, connected in a way that the input of one layer of neuron is the output of the previous layer of neuronsâ€ť. In that context, the hidden layer diagrams from the crash course makes sense.

# Norvig’s summary of ML for software engineers

Peter Norvig summarized the value of ML from a software engineering perspective in his “Introduction to Machine Learning” for Google’s Machine Learning Crash Course:

First, it gives you a tool to reduce the time you spend programming … Second, it will allow you to customize your products, making them better for specific groups of people … And third, machine learning lets you solve problems that you, as a programmer, have no idea how to do by hand.

From my perspective, the first two can be rephrased as:

- Models add a new dimension to code reuse
- For a class of problems, training models scales better than hand-writing code

There’s also a fourth point linked from the bottom of the intro:

Rule #1: Donâ€™t be afraid to launch a product without machine learning

That fourth point reminds me of the “build” vs “grow” domains – until we’ve built a product that lots of people find useful, statistics-based growth tools, like large-scale AB testing, can be relatively high-cost, low-value.We might even say such optimizations only make sense once we have more users than can be efficiently contacted directly. Put another way, if we only have one user, and she says she only wants to see articles about sports, we don’t need ML to predict her interests.

I think about these four points a lot, almost like a koan. They provide a helpful anchor as I try to distill a large amount of theory into tools I can apply to the problems I’m familiar with.

# “Beautiful Future: How Deschutes Uses Artificial Intelligence & Machine Learning to Brew Better Beer”

Craft Beer and Brewing’s article “Beautiful Future: How Deschutes Uses Artificial Intelligence & Machine Learning to Brew Better Beer” describes an intuitive application of ML. Deschutes brewery wanted to more accurately predict when a given fermentation process would complete. The problem statement is simple:

Produce the same amount of beer in less time, while maintaining or improving the quality of the beer along the way, and youâ€™ll have more resources for the intentional play that leads to new beers that drinkers love.

I like the explicit recognition that reducing toil frees time for more valuable activities. This is reiterated later:

Most beer consumers arenâ€™t concerned with how efficiently or cost-effectively a brewery makes their beerâ€”they want high-quality beer, and they want new and exciting beers.

Fermentation sounds like a relatively simple curve to plot. It’s easy to imagine manually monitoring something like sugar content vs time, and then using that data to train a model.

Brewers now trust automation to act on the predictions:

Today, cellar operators at Deschutes have such a high level of confidence in the algorithm that they typically allow the software to trigger next steps in the brewing process.

The automation is also easy to imagine. Deschutes’ Brewery Pi project targets Raspberry Pi, which I can see being used to drive hardware to adjust temperature, add nutrients, drain a fermentation vessel, etc. I really like how Deschutes made the code open-source đźŤ»

# New York Times’ “How We’ll Think Tomorrow”

The NY Times has a series of articles exploring machine learning and artificial intelligence.

# MLCC: Regularization for sparsity

I am working through Googleâ€™s Machine Learning Crash Course. The notes in this post cover the â€śRegularization for Sparsityâ€ť module.

Best-practice: if youâ€™re overfitting, you want to regularize.

“Convex Optimization” by Boyd and Vandenberghe, linked from multiple glossary entries, touches on many of the points made by the crash course:

- â€śA problem is sparse if each constraint function depends on only a small number of the variablesâ€ť
- â€śLike least-squares or linear programming, there are very effective algorithms that can reliably and efficiently solve even large convex problemsâ€ť, which would explain why gradient descent is a tool we use
- Regularization is when â€śextra terms are added to the cost functionâ€ť
- “If the problem is sparse, or has some other exploitable structure, we can often solve problems with tens or hundreds of thousands of variables and constraint”, so it would seem performance is another motivation for regularization

Ideally, we could perform L_{0} normalization, but thatâ€™s non-convex, and so, NP-hard (slide 7). (I like Math is Fun’s NP-complete pageđź™‚ As noted wrt gradient descent, we need a convex loss curve to optimize. L_{1} approximates L_{0} and is easy to compute.

Quora provides a couple intuitive explanations for L1 and L2 norms: â€śL2 norm there yields Euclidean distance â€¦ The L1 norm gives rise to what can be referred to as the “taxi-cab” distanceâ€ť

Rorasa’s blog states â€śNorm may come in many forms and many names, including these popular name: Euclidean distance, Mean-squared Error, etc â€¦ Because the lack of l_{0}-normâ€™s mathematical representation, l_{0}-minimisation is regarded by computer scientist as an NP-hard problem, simply says that itâ€™s too complex and almost impossible to solve. In many case, l_{0}-minimisation problem is relaxed to be higher-order norm problem such as l_{1}-minimisation and l_{2}-minimisation.â€ť

The glossary summarizes:

- L
_{1}regularization â€śpenalizes weights in proportion to the sum of the absolute values of the weights. In models relying on sparse features, L_{1}regularization helps drive the weights of irrelevant or barely relevant features to exactly 0â€ť - L
_{2}regularization â€śpenalizes weights in proportion to the sum of the squares of the weights. L_{2}regularization helps drive outlier weights (those with high positive or low negative values) closer to 0 but not quite to 0â€ť

# MLCC: Logistic regression

I am working through Googleâ€™s Machine Learning Crash Course. The notes in this post cover the â€śLogistic Regressionâ€ť module.

â€śLogistic regressionâ€ť generates a probability (a value between 0 and 1). Itâ€™s also very efficient.

Note the glossary defines logistic regression as a classification model, which is weird since it has â€śregressionâ€ť in the name. I suspect this is explained by â€śYou can interpret the value between 0 and 1 in either of the following two ways: â€¦ a binary classification problem â€¦ As a value to be compared against a classification threshold …â€ť

The â€śsigmoidâ€ť function, aka â€ślogisticâ€ť function/transform, produces a bounded value between 0 and 1.

Note the sigmoid function is just `y = 1 / 1 + e ^ - đťžĽ`

where đťžĽ is our usual linear equation. I suppose weâ€™re transforming the linear output into a logistic form.

Regularization (notes) is important in logistic regression. â€śWithout regularization, the asymptotic nature of logistic regression would keep driving loss towards 0 in high dimensionsâ€ť, esp L_{2} regularization and stopping early.

The â€ślogitâ€ť, aka â€ślog-oddsâ€ť, function is the inverse of the logistic function.

The loss function for logistic regression is â€ślog lossâ€ť.

# MLCC: Classification

I amÂ working through Googleâ€™s Machine Learning Crash Course. The notes in this post cover theÂ â€śClassificationâ€ťÂ module.

New metrics for evaluating classification performance:

- Accuracy
- Precision
- Recall
- ROC
- AUC

## Accuracy

“Accuracy” simply measures percentage of correct predictions.

It fails on class-imbalance, aka â€śskewed classâ€ť, problems, though. Neptune AI states is bluntly: â€śYou shouldnâ€™t use accuracy on imbalanced problems.â€ť Heuristic: is the percent accuracy > the imbalance? For example, if a population is 99% disease-free, an accuracy of 99% requires no intelligence. This is called the â€śaccuracy paradoxâ€ť. Precision and recall are better suited to class-imbalance problems.

Tip: calculate odds independently if possible to compare with accuracy.

## Confusion matrix

A â€śconfusion matrixâ€ť, aka â€śclassification matrixâ€ť, quantifies predicted vs actual outcomes, which is useful for evaluating model performance.

A false positive is a â€śtype oneâ€ť error. A false negative is a â€śtype twoâ€ť error. When the cost of error is high, type two must be minimized. In other words, when the cost of error is high, maximize recall.

## Precision and recall

Andrew Ngâ€™s â€śLecture 11.4 â€” Machine Learning System Design | Trading Off Precision And Recallâ€ť provides a helpful phrasing:

- Precision = true positive / predicted positive
- Recall = true positive / actual positive

Regarding the accuracy paradox, if a model simply predicts negative all the time (eg because 99% of email isnâ€™t spam), it will fail recall and precision because it never has a true positive.

Wikipedia makes a point: â€śIt is trivial to achieve recall of 100% by returning all documents in response to any queryâ€ť

Precision and recall are important, and in tension. Classification depends on a â€śthresholdâ€ť. Increasing the threshold increases precision, but decreases recall. Wikipedia uses surgery for a brain tumor to illustrate: a conservative approach increases the risk of false negative; an aggressive approach increases risk of false positive. Plotting the â€śprecision-recall curveâ€ť can also help demonstrate the relationship, as demonstrated by Andrew Ng.

Wikipedia has a nice visualization differentiating precision and recall:

## ROC and AUC

The “ROC curve” helps identify the best threshold.

“AUC” compares ROCs, helping identify the best model.

StatQuestâ€™s â€śROC and AUC, Clearly Explained!â€ť states precision is a better metric than the false positive rate for class imbalance problems because it doesnâ€™t take true negatives into account.

Keras gives us AUC for a model, but whatâ€™s the corresponding threshold? The crash course clarifies: â€śAUC is classification-threshold-invariant. It measures the quality of the model’s predictions irrespective of what classification threshold is chosen.â€ť Ok, then why use anything but AUC? Neptune AI summarizes: â€ś… use it when you care equally about positive and negative classes.â€ť

## Prediction bias

Seems like this is another way of quantifying model performance. If we know a probability of occurrence and the model produces a significantly different probability, that indicates somethingâ€™s amiss.

The formal definition is: average predicted occurrence – average actual occurrence. Thereâ€™s a helpful note that a model simply returning the average occurrence would have zero prediction bias, but would still be a bad model.

The crash course gives a few causes for bias. StatQuestâ€™s â€śMachine Learning Fundamentals: Bias and Varianceâ€ť adds another: the inability of a ML algorithm to capture the true relationship between features and labels, eg linear regression trying to capture a curved relationship.

Fix prediction bias in the model, rather than adjusting the model output.

Interesting clarification that predicted values are a probability range, but actual values are discrete, so we need to segment values and average them to make a comparison.

# MLCC: Regularization

I amÂ working through Googleâ€™s Machine Learning Crash Course. The notes in this post cover theÂ â€śRegularizationâ€ť module.

An earlier module focused on generalization (notes). A â€śgeneralization curveâ€ť visualizes generalization by showing loss for training data vs loss for validation data.

When training loss is less than validation loss, weâ€™re â€śoverfittingâ€ť to the training data, reducing generalization.

â€śRegularizationâ€ť is the process of preventing overfitting. The TensorFlow docs also discuss regularization.

â€śEmpirical risk minimizationâ€ť refers to loss reduction using tools like gradient descent (notes).

â€śStructural risk minimizationâ€ť refers to regularization by minimizing the complexity of the model.

The â€śL2 regularizationâ€ť formula quantifies complexity as the sum of the squares of the feature weights.

â€śLambdaâ€ť aka â€śregularization rateâ€ť governs the amount of regularization applied. Increasing lambda strengthens regularization, resulting in a steeper histogram of weights, for example. A tool called Vizier can help optimize lambda.

Helpful phrasing from StatQuestâ€™s “Machine Learning Fundamentals: Bias and Variance”: regularization is one technique for finding a balance between a simple model (that may have high bias) and a complex model (that may have high variability).

## Exercise 1

The answer for task 1 in the first exercise, notes the â€śrelative weightâ€ť of lines from FEATURE to OUTPUT in the playground. What is “relative weight”? đź¤” Later, the second exercise mentions â€śThe relative thickness of each line running from FEATURES to OUTPUT represents the learned weight for that feature or feature cross. You can find the exact weight values by hovering over each line.â€ť So, â€śrelative weightâ€ť in this context is just referring to the weight of one line relative to another, rather than a novel concept.

The answer for task 1 states: â€śThe lines emanating from X_{1} and X_{2} are much thicker than those coming from the feature crosses. So, the feature crosses are contributing far less to the model than the normal (uncrossed) features.â€ť Task 2 states â€śIf we use a model that is too complicated, such as one with too many crosses …â€ť Later, we learn â€śIf model complexity is a function of weights …â€ť Is complexity a function of crosses or weights? đź¤” I guess the idea is that the additional complexity of the crosses was driving up the weight of the uncrossed features, irrespective of regularization. Running the playground with and without the cross supports this, eg ~1.5, 0.131 and 0.033, respectively, vs ~0.9 with losses 0.096 and 0.039. Running with the cross and 0.3 regularization results in ~0.3, 0.092 and 0.059. Running with just 0.3 regularization results in ~0.3, 0.093 and 0.061. So it would seem there are at least a couple, orthogonal components to â€ścomplexityâ€ť.

## Exercise 2

An answer in the playground mentions: â€śWhile test loss decreases, training loss actually increases. This is expected, because you’ve added another term to the loss function to penalize complexity.â€ť đź¤” I think this is referring to the literal addition of the complexity term in the calculation to find a weight ( `minimize(loss(data|model)) + complexity(model)`

).