Wednesday, May 07, 2014

Data Provenance for Analytics

A short piece in the CACM that points to a larger paper.  That larger paper requires subscription.  I am using this for some work on metadata uses.  Provenance is tracking some entity, such as data, over time to understand where it has come from and crucially the circumstances under which the data was gathered. Often key to using it for predicting things using analytics.  Nicely done paper I am bookmarking for a larger effort.

Full Abstract:
" ... Assessing the quality or validity of a piece of data is not usually done in isolation. You typically examine the context in which the data appears and try to determine its original sources or review the process through which it was created. This is not so straightforward when dealing with digital data, however: the result of a computation might have been derived from numerous sources and by applying complex successive transformations, possibly over long periods of time.

As the quantity of data that contributes to a particular result increases, keeping track of how different sources and transformations are related to each other becomes more difficult. This constrains the ability to answer questions regarding a result's history, such as: What were the underlying assumptions on which the result is based? Under what conditions does it remain valid? What other results were derived from the same data sources? .. ." 

