/* ---- Google Analytics Code Below */

Tuesday, March 09, 2021

Using Synthetic Data

Worked with elements of the idea for some time the enterprise. In fact after a while we learned that its often very useful to have synthetic data alongside 'real' data,  it can be used to tease out some less than  common patterns of interest and use.  New methods make it much easier to develop.

Synthetic Data: Even Better than the Real Thing?  By Karen Emslie  Commissioned by CACM Staff  March 9, 2021

Synthetic data can be useful in any real data-based context: researchers have demonstrated the use of synthetic data in object detection, in crowd counting, in machine learning for healthcare, and even in marine science for the detection of Western rock lobsters.

Our lives are inextricably intertwined with data. It is fundamental in software development, artificial intelligence (AI) training, and product testing; it is deployed across industry, social media, and in decision-making. According to a 2020 report by market research firm International Data Corporation, "More than 59 zettabytes (ZB) of data will be created, captured, copied, and consumed in the world this year."

This is a mind-boggling amount of data, but it is not always available to those who want to make use of it. Innovators working on emerging technologies, such as autonomous vehicles, may find relevant data rare and prohibitively expensive. Access to developers is often limited, due to confidentiality.

Synthetic data, generated from simulations based on real data, has emerged as an answer. It is not a new concept, but recent developments have boosted its accuracy and usability. Add societal issues such as privacy, the General Data Protection Regulation (GDPR), and even the impact of the Covid-19 pandemic on data gathering and access, and the arguments on behalf of synthetic data appear even stronger.

Synthetic data can be useful in any real data-based context: researchers have demonstrated the use of synthetic data in object detection, in crowd counting, in machine learning for healthcare, and even in marine science for the detection of Western rock lobsters.

A group at the Massachusetts Institute of Technology (MIT) led by principal research scientist and Data-to-AI group leader Kalyan Veeramachaneni, have launched an updated set of open-source tools for producing synthetic data. The work is part of the Synthetic Data Vault (SDV), an online ecosystem that allows users to create synthetic data from their own data sources.

Veeramachaneni first experimented with synthetic data back in 2012, to tackle data access bottlenecks in an online learning platform. He realized it could also provide a solution to a problem he had encountered in industry during conversations about data access for machine learning (ML).

"All those conversations come to a grinding halt when we say, 'How can we get access to the data? For that we have to go through this process, and then what do we do next?' It takes three to six months to actually get access to the data," explained Veeramachaneni.

His group set out to build general-purpose tools that would allow anyone to create synthetic data from real data. By 2016, they had succeeded in creating statistical models using datasets from Kaggle, and sampling from those to create synthetic data.

The next step was to take a "much, much more comprehensive" approach by simultaneously creating algorithms, software, and tools that could address any enterprise data type. The result was the Synthetic Data Vault.

The researchers use three types of modelling techniques to generate synthetic data: a classic technique based on Bayesian networks, a mathematical tool from economics called Copulas, and deep learning (DL).

"Deep learning-based synthetic data generation started for images, that's where you see all those deep fakes, and there was a very popular technique called generative adversarial networks (GANs)," said Veeramachaneni.

The MIT group adapted GAN methods used on pixel-based images to work on tabular data. The trick is to generate realistic-looking data, said Veeramachaneni, but it is a fine balancing act, "You don't want it to be so real that it can actually enable you to detect some personal information about someone if it belongs to humans."

The latest tools in the SDV ecosystem support scalability, testing, and interaction with data science teams. To prove the functionality of algorithms and software, users need to come up with edge cases. As Veeramachaneni explained, "Slowly and steadily, we have seen a lot of people coming to it, using it, telling us where it's working, where it's not working, and that's essentially driving us to make it much better."

When the Covid-19 pandemic shut down MIT's Data-to-AI labs, the group spotted another use case. Sensitive data is often housed on one or two computers. Veeramachaneni said the team had to work out how keep their own machines up and running, "Then we were like, 'wouldn't it help to just have synthetic data, so that everyone can have their data on their local machine at home?'"

Privacy and access make a solid case for synthetic data use, but there are others  ... " 

No comments: