Sunday, May 16, 2021

Cleaning Messy Data Tables

Looks to be efficient at least.  Check out the code.   Note the use of Bayesian reasoning.

A system developed by researchers at the Massachusetts Institute of Technology (MIT) automatically cleans "dirty data" of things such as typos, duplicates, missing values, misspellings, and inconsistencies.

PClean combines background information about the database and possible issues with common-sense probabilistic reasoning to make judgment calls for specific databases and error types. Its repairs are based on Bayesian reasoning, which applies probabilities based on prior knowledge to ambiguous data to determine the correct answer, and can provide calibrated estimates of its uncertainty.

The researchers found that PClean, with just 50 lines of code, outperformed benchmarks in both accuracy and runtime.

From MIT News

