MLCC: Generalization

I am working through Google’s Machine Learning Crash Course. The notes in this post cover [1] through [3].

The fundamental tension of machine learning is between fitting our data well, but also fitting the data as simply as possible.

A reasonable guideline: “The less complex an ML model, the more likely that a good empirical result is not just due to the peculiarities of the sample.”

[2] recommends a best-practice: divide labeled examples into “training” and “test” sets.

Never train on test data! 100% accuracy can be a symptom of that.

[3] goes further: divide labeled examples into three sets: “training”, “validation” and “test”. Simply testing against a “test” set risks overfitting to that set. Instead, iterate against the validation set, and then double-check using the test set.

A continuing impression is that TensorFlow builds in a lot of the best-practices described in this crash course. For example, splitting out a validation set and testing against it is a first-class argument to the Model.fit method.

The exercise associated with [3] is interesting. First, testing against a validation set caught a bug! Second, the bug was a default sort on the latitude column; the validation set was not a random sample.

References

  1. Google Machine Learning Crash Course: “Generalization”
  2. Google Machine Learning Crash Course: “Training and Test Sets”
  3. Google Machine Learning Crash Course: “Validation Set”

One thought on “MLCC: Generalization

Comments are closed.