The Eponymous Pickle: Labeled Datasets for Use AND Value

Tuesday, January 22, 2019

Labeled Datasets for Use AND Value

Note how this links to other goals, like understanding the value of datasets as an asset. Why not use labeling as a means to attach to value analyses as well? Labels are usually assigned for business purposes that do not work when linking to specific analytic approach.

The Data Scientist’s Holy Grail — Labeled Data Sets #ODSC in Medium

The Holy Grail for data scientists is the ability to obtain labeled data sets for the purpose of training a supervised machine learning algorithm. An algorithm’s ability to “learn” is based on training it using a labeled training set — having known response variable values that correspond to a number of predictor variable values.

There are a number of common and maybe not-so-common methods for labeling a data set. In this article, we’ll run down a short list of such methods and then you can choose the best for your specific circumstances.

Readily Available Labeled Data Sets:

Sometimes, labeled datasets are readily available as a byproduct of on-going business operations. For example, if a company is trying to predict customer churn (a very common classification problem), the company’s data assets will likely contain the label values: “churned,” or “not-churned” based on the customer’s account history. The company knows when the customer canceled their account, thus generating a churn transaction.

Sometimes, the label is not readily available and must be acquired or derived. For example, in a real estate application that wishes to predict the monthly rental value of a residential apartment building, the desired label may only come from a laborious process conducted by problem domain experts who can determine the value based on their industry knowledge. Sometimes finding label values can be time-consuming and labor-intensive, especially if a large amount of labeled data is needed for the project. ... "

The Eponymous Pickle

About Me

RSS

Blog Archive

Tuesday, January 22, 2019

Labeled Datasets for Use AND Value

No comments: