The Eponymous Pickle: Cleansing Data

Showing posts with label Cleansing Data. Show all posts

Sunday, May 16, 2021

Cleaning Messy Data Tables

Looks to be efficient at least. Check out the code. Note the use of Bayesian reasoning.

Home/News/System Cleans Messy Data Tables Automatically/Full Text

ACM TECHNEWS System Cleans Messy Data Tables Automatically By MIT News

A system developed by researchers at the Massachusetts Institute of Technology (MIT) automatically cleans "dirty data" of things such as typos, duplicates, missing values, misspellings, and inconsistencies.

PClean combines background information about the database and possible issues with common-sense probabilistic reasoning to make judgment calls for specific databases and error types. Its repairs are based on Bayesian reasoning, which applies probabilities based on prior knowledge to ambiguous data to determine the correct answer, and can provide calibrated estimates of its uncertainty.

The researchers found that PClean, with just 50 lines of code, outperformed benchmarks in both accuracy and runtime.

From MIT News

Sunday, December 13, 2020

Thinking about Approaching Tidy Data

Below an intro on the concept. We laid out and used similar ideas, this organizes it well. First stated by Hadley Wickham in his paper. Hard to fully achieve because of context, but very useful.

What is Tidy Data?

A must-know concept for Data Scientists. Outline by Benedict Neo in Towards Data Science

Introduction

There’s a popular saying in Data Science that goes like this — “Data Scientists spend up to 80% of the time on data cleaning and 20 percent of their time on actual data analysis”. The origin of this quote goes back to 2003, in Dasu and Johnson’s book, Exploratory Data Mining and Data Cleaning, and it still true to this day.

In a typical Data Science project, from importing your data to communicating your results, tidying your data is a crucial aspect in making your workflow more productive and efficient. ...

The process of tidying data would thus create what’s known as tidy data, which is an ideal first formulated by Hadley Wickham in his paper. So my article will be largely a summarization or extracting the essence of the paper if you will.

What is Tidy Data?

From the paper, the definition given is:

Tidy datasets provide a standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning) To break down this definition, you have to first understand what structure and semantics means. ..."

Wednesday, April 08, 2020

Data Preparation is the Most Important Thing

A very good O'Reilly piece, not very technical. Essential thoughts.

The unreasonable importance of data preparation
Your models are only as good as your data.
By Hugo Bowne-Anderson in O'Reilly

Edit note: We know data preparation requires a ton of work and thought. In this provocative article, Hugo Bowne-Anderson provides a formal rationale for why that work matters, why data preparation is particularly important for reanalyzing data, and why you should stay focused on the question you hope to answer. Along the way, Hugo introduces how tools and automation can help augment analysts and better enable real-time models.

In a world focused on buzzword-driven models and algorithms, you’d be forgiven for forgetting about the unreasonable importance of data preparation and quality: your models are only as good as the data you feed them. This is the garbage in, garbage out principle: flawed data going in leads to flawed results, algorithms, and business decisions. If a self-driving car’s decision-making algorithm is trained on data of traffic collected during the day, you wouldn’t put it on the roads at night. To take it a step further, if such an algorithm is trained in an environment with cars driven by humans, how can you expect it to perform well on roads with other self-driving cars? Beyond the autonomous driving example described, the “garbage in” side of the equation can take many forms—for example, incorrectly entered data, poorly packaged data, and data collected incorrectly, more of which we’ll address below.

When executives ask me how to approach an AI transformation, I show them Monica Rogati’s AI Hierarchy of Needs, which has AI at the top, and everything is built upon the foundation of data (Rogati is a data science and AI advisor, former VP of data at Jawbone, and former LinkedIn data scientist): .... '

Saturday, August 10, 2019

Skeptical About Data

Or more precisely about common errors in data. Good overview of what the data error can look like. Problem is we rarely look at this closely enough. And data is also rarely looked at over time after the model is built to see creeping error. You should be skeptical about data. But without the data, where would you be?

I’m a data scientist who is skeptical about data
By Andrea Jones-RooyJuly 24, 2019 in Quartz
Professor of data science, NYU

After millennia of relying on anecdotes, instincts, and old wives’ tales as evidence of our opinions, most of us today demand that people use data to support their arguments and ideas. Whether it’s curing cancer, solving workplace inequality, or winning elections, data is now perceived as being the Rosetta stone for cracking the code of pretty much all of human existence.

But in the frenzy, we’ve conflated data with truth. And this has dangerous implications for our ability to understand, explain, and improve the things we care about. .... "

Saturday, June 01, 2019

Managing Bad Data

Self correcting data is a good idea, but want to see some examples. Its similar to removing outliers. It depends on the context and often metadata involved. Have seen it improperly used.

Researchers Develop AI Tool Better Able to Identify Bad Data
University of Waterloo News

An international team of researchers led by Alireza Heidari and Ihab Ilyas at the University of Waterloo in Canada has developed an artificial intelligence-powered system to manage data quality. The HoloClean tool sifts out bad data and corrects errors prior to processing. The new system also can automatically generate bad examples without tainting source data, so the system can learn to identify and correct errors on its own. Once HoloClean is trained, it can independently differentiate between errors and correct data, and determine the most likely value for missing data if an error exists. Ilyas said the work “deviates from the old way of manually trying to clean the data, which was expensive, didn’t scale, and does not meet the current needs for cleaning the data.” ... '

Monday, August 06, 2018

Why The Data can be Very Wrong

Colleague Kaiser Fung on why the data can be very wrong, or at least very fragile. His site is required.

Do we really know what the data are measuring? Hint: no
The tech industry has turned us into an omni-surveillance society.

Any shop that uses modern, digital, connected technologies is probably collecting, storing and selling your data to someone. The people receiving and analyzing the data form a much larger set than those collecting the data. These data analysts typically ingest the data as are, and write software that controls this or that aspect of our lives. However, such data are riddled with inaccuracies and bias, which is a form of inaccuracy.

While in Vancouver last week, I encountered the following two scenarios that illustrate the fragility of data collection. .... "

Saturday, October 14, 2017

Fake Data

My colleague Kaiser Fung talks fake data. Good thoughts and pointers to examples and resources. Read his whole article at the link.

Here is a problem staring many digital/Web/social media analysts in the face today: what if you are told that the majority of the data you have been dutifully reporting, analyzing and (gasp!) modeling are fake data?

By fake data, I mean, useless numbers that have no bearing on reality: visits to websites that never happened, clicks on ads by hired hands, clicks on ads by bots, clicks on ads that are buried layers deep invisible to any humans, video "views" that result from automatically playing clips, video "views" that last one second, ad reach (i.e. number of people who have seen the ad) that exceeds Census counts, reviews planted by hired hands, etc. etc. .... "

Wednesday, August 09, 2017

Data Fracking Dark Data

Found this to be an interesting, the idea of 'Data Fracking' ... a means of finding the right data. Better way to find and use your data assets? Improving access to the right data. Broadly using the oil drilling method as metaphor. Developed by a company called Datumize. Through the use of the right Dark Data, which is unused data in or outside your company.

" .... Effectively exploiting this resource will require new techniques. Data Fracking™ is a new approach focused on the discovery, collection, integration and deployment of Dark Data so it can be collected, refined and made available for use to enhance your operational and decision making processes. Data Fracking™ enables the discovery, collection, integration and utilization of this previously untapped resource. .... " , Have not tried this as yet, but plan to follow up. Piece on this in DSC. And a white paper on the method.

Monday, July 24, 2017

Roomba Maps for Smart Homes

This reminds me of work we did with test Roombas. We were interested in how they compared with human cleaning process and also usage of possible cleansing liquids, in lab settings. But I can see how the resulting maps might be used for Smart Home navigation and understanding. I recently read that the still unreleased Kuri robot automatically builds maps of its home space. As expected, has this has privacy implications, mentioned at the link.

Your Roomba’s digital map of your house could be for sale by Eric David in Siliconangle

Robotic vacuums like the Roomba have made it easy for even the laziest people to keep their floors clean, and now Roomba maker iRobot Corp. has found another use for the little robots: building maps for your smart home devices.

Some of the smarter Roomba models build maps of your home to make their cleaning paths more efficient. Over time, they learn the locations of walls, doorways, lamps, furniture and so on, which eventually allows them to clean your home without bumping into things over and over. The idea behind this mapping system is to reduce the time it take for your Roomba to finish its job, but iRobot Chief Executive Colin Angle says the mapping data is useful for more than just speeding up your vacuum. .... "

Saturday, June 18, 2016

Clean Data for Analytics

Recently been involved in the process of addressing data quality. Key for any analytics. A general look. One other way to think of this is to use semantic data management tools to create precise statements of the knowledge that sits behind the data. Examined that with Stanford KSL a while ago. That ultimately evolved to the package Protoge. Most recently took a dive into the impressive package Top Quadrant. What other large entities are doing this?

Sunday, March 20, 2016

Faster Machine Unlearning

In the CACM:
" ... Researchers at Lehigh and Columbia University have developed a machine-learning method that involves making such systems forget the data's "lineage" so they can remove the data and undo its effects and allow future operations to run as if the data never existed. Although the concept of "machine unlearning" is well-established, the researchers have developed a way to do it faster and more effectively than can be done using current methods. Effective machine-unlearning techniques can help improve the privacy and security of raw data. ... "

Wednesday, October 14, 2015

Deception in a Data Driven World

Lots in the news recently about storytelling with data. Which led me to memories of stories that did not match the actual reality of the data. So a good story is not necessarily good science. Confirmation bias in storytelling. And beyond that, even the data itself can lie, for all sorts of reasons. In Fusion.net.

Thursday, October 08, 2015

Data Preparation

In O'Reilly. Just downloaded. Basic points on data preparation.

Translating data into knowledge by Federico Castanedo
Best practices for data preparation — what you need to know before data analysis can begin.
Download “Data Preparation in the Big Data Era,” a new free report to help you manage the challenges of data cleaning and preparation. .... "

Sunday, August 09, 2015

Defining Data Quality

Good set of definitions it is worth considering.

" .. To tackle any problem in a systematic and effective way, you must be able to break it down into parts. After all, understanding the problem is the first step to finding the solution. From there, you can develop a strategic battle plan.

When starting a data quality improvement program, it’s not enough to count the amount of records that are incorrect, or duplicated, in your database. Quantity only goes so far. You also need to know what kind of errors exist to allocate the correct resource.

In this interesting blog by Jim Barker, the different types of data quality are broken down into two parts. In this article, we’ll look closely at defining these ‘types’, and how we can use this to our advantage when developing a budget. ... "

Monday, July 27, 2015

Telling Good Data from Bad

From Forbes. Hardly complete, but a reasonable starting point. Which can be mostly summarized as: Understand your sources, ask for certification when available, ask the right questions, (which often depend on the specific business context) . Look for red flags in the data itself. . Also mentioned, the likelihood of confirmation bias creeping into results. I almost always found some errors of this type I will add that this kind of bias can be either biased data that will support your own theories, or those that will best sell to your decision clients.

Wednesday, July 15, 2015

Inspect Your Data

True, and for very important data you should have the expert in or owner of the data make sure it, or any derivative of the data, is correct. In ClickZ:

" ... Something as small as a single missing or broken tag can result in missing data. As a result, marketers need to really inspect data, rather than simply looking at the aggregated report.
With the seemingly unstoppable growth of advertising technology and the rise of automated media-buying systems, we have developed an incredible reliance on data. Nobody could be more excited than I am about all the interest and power being given to the massive amounts of data available to marketers today. However, we need to start paying the same level of attention to the quality of our data sources and collection systems. .."

Monday, July 06, 2015

Supply Chain Planning Garbage in Equals Garbage Out

Very interesting piece, which addresses a problem we often encountered, and was too often not addressed well enough. Every enterprise should have a handle on cleansing and trends in data. Read the whole thing for a solution view.

By Cheryl Wiebe, Partner, Applied Analytics, Manufacturing at Teradata

Supply Chain Planning: Garbage In Garbage Out...
Lean manufacturing has been massively adopted over the last 10 years. Lean is driving attention to the notion of supply chain variability and accuracy of planning factors. Kanban, or pull style, says that the impetus for manufacturing activity (e.g., arrival of starting kit to begin assembly) is the arrival of the required materials at your station. This means if we have insufficient material to kit the line, the line goes down.

What this means? Lean assumes your supply chain is infallible. This forces suppliers to plan JIT warehouses at the beginning of the line. A certain disk drive manufacturer (supplier A) had to supply 18,000 drives at the JIT warehouse nearby the customer's PC assembly plant, for example. If they dipped below 18k, the customer would not pull from supplier A, and would pull instead from supplier B, the competitor. This caused supplier A to ensure they always had 25k drives (excess inventory). A short on inventory from suppliers will bring down a Lean manufacturing line. The supply problem has now moved into the JIT warehouse. We have moved the problem upstream. ... "

Thursday, July 02, 2015

Examining Data Preparation Tools

Better tools, likely that includes cognitive and context knowledge capabilities are needed. Also paying more attention to metadata needs: In O'Reilly:

Why data preparation frameworks rely on human-in-the-loop systems
The O'Reilly Data Show Podcast: Ihab Ilyas on building data wrangling and data enrichment tools in academia and industry. ... "

Wednesday, October 29, 2014

Watson Analytics Preps the Data

In InfoWorld: Looking forward to seeing this. It is a very key thing to getting analytics done right. " ... "Often what business users do is rely on a data scientist or business analysts to help them, which can be too slow, or those people may not be available. So they make a decision without any analysis," said Eric Sall, IBM vice president of marketing for business analytics. ... ". Cleansing, though, can be a subjective thing, removing some of the useful essences of knowledge. So I want to see the details here before handing over the data.

Monday, September 01, 2014

Bad Data Handbook

I like the book title premise. Having just been involved in a project that included dealing with bad data, however it is defined. Also to be considered: Data and metadata that is not or miss-connected to data in use. Or stated as "From cranky storage to poor representation to misguided policy, there are many paths to bad data. Bottom line? Bad data is data that gets in the way. This book explains effective ways to get around it. ... ". Don't have the book, so can't say it delivers. More about the book in Analyticbridge.

About Me

RSS

Blog Archive