MLCC: Representation

I am working through Google’s Machine Learning Crash Course. The notes in this post cover the “Representation” section.

feature engineering is another topic which doesn’t seem to merit any review papers or books, or even chapters in books, but it is absolutely vital to ML success. […] Much of the success of machine learning is actually success in engineering features that a learner can understand.

Scott Locklin, in “Neglected machine learning ideas” AQI Machine Learning Mastery’s feature engineering overview

I’ve heard 80% of data science is cleaning. This section introduces a nuance: cleaning includes a step mapping raw data into a format that’s appropriate and efficient for inputting into a model. The “scrubbing” sub-section actually seems like the only thing that fits what I previously thought of as “cleaning”, eg removing human errors, addressing incomplete data, etc.

The whole section has good recommendations I can see serving as an ongoing reference. For example:

  • Good feature values should appear more than 5 or so times in a data set … avoid unique IDs
  • Keep data pure by not encoding exceptional states into a feature’s value type, eg an integer feature where -1 means undefined, aka “magic” values. Instead, use boolean flags for exceptional states.

The “Z score” scales values as follows: scaled = (value - mean) stdev. Math is Fun has a good explanation for how to derive the standard deviation, but Pandas also provides it trivially in the output from describe.

“Binning” seems similar to *-hot encoding in that we’re enabling weights for each value, although the former concerns continuous values and the latter concerns discrete values. The feature cross video supports this by referring to both in the same context.

Histograms and stats, like those output by describe, can help detect bad data.