The Data Scientist’s Holy Grail — Labeled Data Sets #ODSC in Medium
The Holy Grail for data scientists is the ability to obtain labeled data sets for the purpose of training a supervised machine learning algorithm. An algorithm’s ability to “learn” is based on training it using a labeled training set — having known response variable values that correspond to a number of predictor variable values.
There are a number of common and maybe not-so-common methods for labeling a data set. In this article, we’ll run down a short list of such methods and then you can choose the best for your specific circumstances.
Readily Available Labeled Data Sets:
Sometimes, labeled datasets are readily available as a byproduct of on-going business operations. For example, if a company is trying to predict customer churn (a very common classification problem), the company’s data assets will likely contain the label values: “churned,” or “not-churned” based on the customer’s account history. The company knows when the customer canceled their account, thus generating a churn transaction.
Sometimes, the label is not readily available and must be acquired or derived. For example, in a real estate application that wishes to predict the monthly rental value of a residential apartment building, the desired label may only come from a laborious process conducted by problem domain experts who can determine the value based on their industry knowledge. Sometimes finding label values can be time-consuming and labor-intensive, especially if a large amount of labeled data is needed for the project. ... "
No comments:
Post a Comment