/* ---- Google Analytics Code Below */

Wednesday, April 18, 2018

Your Data Has to be Good. Define Good.

Very good piece, covers lots of topics.  Worth reading.   Without enough quality data, you have nothing.  Key elements of process, like getting everyone involved early and often.  Biases mentioned, but the principle kinds of biases are not enumerated, and are often dependent on the business domain.  I like to count through likely biases and specifically test for some of the worst.

If Your Data Is Bad, Your Machine Learning Tools Are Useless    Thomas C. Redman in the HBR

Poor data quality is enemy number one to the widespread, profitable use of machine learning. While the caustic observation, “garbage-in, garbage-out” has plagued analytics and decision-making for generations, it carries a special warning for machine learning. The quality demands of machine learning are steep, and bad data can rear its ugly head twice — first in the historical data used to train the predictive model and second in the new data used by that model to make future decisions.

To properly train a predictive model, historical data must meet exceptionally broad and high quality standards. First, the data must be right: It must be correct, properly labeled, de-deduped, and so forth. But you must also have the right data — lots of unbiased data, over the entire range of inputs for which one aims to develop the predictive model. Most data quality work focuses on one criterion or the other, but for machine learning, you must work on both simultaneously.  .... " 

No comments: