What is synthetic data? And why to synthesize large datasets?

Storing vast amounts of data comes with its own risks and challenges. Synthetic data is one of the options in the toolkit to address them. This article reviews different angles, including data confidentiality, retention and de-identification.

3 min readApr 12, 2021

Census data is unreliable. There are several reasons; missing data, classification difficulties, erroneous, misreported data, among others. On top of these issues, regulatory requirements enforce to preserve anonymity when analysing the dataset.

In 1993 Donald Rubin, author of the book Statistical Analysis with Missing Data, had the original idea of fully synthetic data for privacy-preserving statistical analysis. He originally designed this to synthesize the Census long form responses for the short form households. He then released samples that did not include any actual long form records - in this he preserved anonymity of the household. Later that year, the idea of original partially synthetic data was created by J. A. Little. He used this idea to synthesize the sensitive values on the public use file. [1]

So, what is synthetic data anyway? and what is the difference with Production data?

Production data is "information that is persistently stored and used by professionals to conduct business processes."

Meanwhile, synthetic data is "any production data applicable to a given situation that is not obtained by direct measurement".

In simple terms, synthetic data is data generated by computers under certain rules.

Data confidentiality

Synthetic data is used in a variety of fields as a filter for information that would otherwise compromise the confidentiality of particular aspects of the data.

Synthetic data protects privacy of users. And also can be useful for testing in lower environments. A usual practice is to refresh databases from Production into lower environments. The main issue is to preserve anonymity by synthesizing data. In many cases this needs to be run frequently when creating new environments or starting a new round of testing.

Data retention

Business data stored to conduct business activity must not be kept longer than it is necessary and must be disposed of appropriately. One of the techniques is to de-identify data or synthesize it.

So, why to de-identify data?

De-identification means that a person’s identity is no longer apparent or cannot be reasonably ascertained from the information or data and it helps to meet Privacy Act obligations while building trust in your data governance practices.

De-identification involves two steps. The first is the removal of direct identifiers. The second is taking one or both of the following additional steps: the removal or alteration of other information that could potentially be used to re-identify an individual, and/or the use of controls and safeguards in the data access environment to prevent re-identification.

And, how do you create synthetic data?

There are different techniques to synthesize data for different cases.

SMOTE: Synthetic Minority Over-sampling Technique. This is useful if your dataset is incomplete or imbalance data.
ADASYN: Adaptive Synthetic sampling method. Similar to SMOTE, however this method adapts to the lack of data or lack of well-known categories within data.
Data Augmentation. In this technique, we change existing datasets to have more cases. This is specially useful for training Machine Learning models.
Variational Auto Encoder. Encoding is about converting into another form. In this technique, data will be converted into codes based on a certain distribution.

Other usages. Machine learning

Synthetic data to train machine learning models is rapidly increasing. Some benefits are:

After the initial data generation iterations, it becomes easier to generate new synthetic datasets
Completing categories without synthetic sampling is almost impossible by hand
Perfect substitute for sensitive datasets

Conclusion

Some years ago, Big Data was the biggest trend. Nowadays, we know that accumulating plenty of data comes with its own risks. Bigger dataset bounties for hackers are something to consider on the Big Data trade-offs.

Achieving a balance of data utility versus compliance is challenging. The more we squeeze the data, the more compliance challenges we get.

Synthetic data is a way to achieve this balance. Another solution in the tool kit to mitigate the risk of data breaches for traditional datasets and achieving data augmentation to train machine learning models.

References

[1] Synthetic data — Wikipedia
[2] Synthetic data generation by Cem Dilmegani

Disclaimer

This is a personal article. The opinions expressed here represent my own and not those of my employer.