/* ---- Google Analytics Code Below */

Friday, March 10, 2023

Back to Synthetic Data

Have not talked this for sometime, still interesting.

What is Synthetic Data? The Good, the Bad, and the Ugly

Sharing data can often enable compelling applications and analytics. However, more often than not, valuable datasets contain information of sensitive nature, and thus sharing them can endanger the privacy of users and organizations.

A possible alternative gaining momentum in the research community is to share synthetic data instead. The idea is to release artificially generated datasets that resemble the actual data — more precisely, having similar statistical properties.

So how do you generate synthetic data? What is that useful for? What are the benefits and the risks? What are the fundamental limitations and the open research questions that remain unanswered?

All right, let’s go!

How To Safely Release Data?

Before discussing synthetic data, let’s first consider the “alternatives.”

Anonymization: Theoretically, one could remove personally identifiable information before sharing it. However, in practice, anonymization fails to provide realistic privacy guarantees because a malevolent actor often has auxiliary information that allows them to re-identify anonymized data. For example, when Netflix de-identified movie rankings (as part of a challenge seeking better recommendation systems), Arvind Narayanan and Vitaly Shmatikov de-anonymized a large chunk by cross-referencing them with public information on IMDb.

Aggregation Another approach is to share aggregate statistics about a dataset. For example, telcos can provide statistics about how many people are in some specific locations at a given time — e.g., to assess footfall and decide where one should open a new store. However, this is often ineffective too, as the aggregates can still help an adversary learn something about specific individuals.

Differential Privacy: More promising attempts come from providing access to statistics obtained from the data while adding noise to the queries’ response, guaranteeing differential privacy. However, this approach generally lowers the dataset’s utility, especially on high-dimensional data. Additionally, allowing unlimited non-trivial queries on a dataset can reveal the whole dataset, so this approach needs to keep track of the privacy budget over time.   ... '

No comments: