Evaluating Synthetic Data — The Million Dollar Query

synthetic data generation, we typically create a model for our real (or ‘observed’) data, after which use this model to generate synthetic data. This observed data is often compiled from real world experiences, reminiscent of measurements of the physical characteristics of irises or details about individuals who’ve defaulted on credit or acquired some medical condition. We will consider the observed data as having come from some ‘parent distribution’ — the true underlying distribution from which the observed data is a random sample. In fact, we never know this parent distribution — it should be estimated, and that is the aim of our model.

But if our model can produce synthetic data that could be considered to be a random sample from the identical parent distribution, then we’ve hit the jackpot: the synthetic data will possess the identical statistical properties and patterns because the observed data (); it is going to be just as useful when put to tasks reminiscent of regression or classification (); and, since it is a random sample, there is no such thing as a risk of it identifying the observed data (). But how can we all know if now we have met this elusive goal?

In the primary a part of this story, we are going to conduct some easy experiments to realize a greater understanding of the issue and motivate an answer. Within the second part we are going to evaluate performance of a wide range of synthetic data generators on a group of well-known datasets.

Part 1 — Some Easy Experiments

Consider the next two datasets and check out to reply this query:

Figure 1. Two datasets. Are each datasets random samples from the identical parent distribution, or has one been derived from the opposite by small random perturbations? [Image by Author]

The datasets clearly display similar statistical properties, reminiscent of marginal distributions and covariances. They might also perform similarly on a classification task by which a classifier trained on one dataset is tested on the opposite.

But suppose we were to plot the information points from each dataset on the identical graph. If the datasets are random samples from the identical parent distribution, we might intuitively expect the points from one dataset to be interspersed with those from the opposite in such a way that, on average, points from one set are as near — or ‘as just like’ — their closest neighbors in that set as they’re to their closest neighbors in the opposite set. Nonetheless, if one dataset is a slight random perturbation of the opposite, then points from one set shall be more just like their closest neighbors in the opposite set than they’re to their closest neighbors in the identical set. This results in the next test.

The Maximum Similarity Test

**Figure 2.** Two datasets: one red, one black. Black arrows indicate the closest (or ‘most similar’) black neighbor (head) to every black point (tail) — the similarities between these pairs are the ‘maximum intra-set similarities’ for black. Red arrows indicate the closest black neighbor (head) to every red point (tail) — similarities between these pairs are the ‘maximum cross-set similarities’. [Image by Author]

For the reason that datasets we take care of on this story all contain a mix of numerical and categorical variables, we’d like a similarity measure which might accommodate this. We use Gower Similarity¹.

The table and histograms below show the means and distributions of the utmost intra- and cross-set similarities for Datasets 1 and a pair of.

**Figure 3.** Distribution of maximum intra- and cross-set similarities for Datasets 1 and a pair of. [Image by Author]

On average, the instances in one data set are more just like their closest neighbors within the other dataset than they’re to their closest neighbors in the identical dataset. This means that the datasets usually tend to be perturbations of every apart from random samples from the identical parent distribution. And indeed, they’re perturbations! Dataset 1 was generated from a Gaussian mixture model; Dataset 2 was generated by choosing (without alternative) an instance from Dataset 1 and applying a small random perturbation.

Ultimately, we shall be using the Maximum Similarity Test to match synthetic datasets with observed datasets. The most important danger with synthetic data points being too near observed points is privacy; i.e., having the ability to discover points within the observed set from points within the synthetic set. In truth, if you happen to examine Datasets 1 and a pair of rigorously, you would possibly actually give you the option to discover some such pairs. And that is for a case by which the typical maximum cross-set similarity is just 0.3% larger than the typical maximum intra-set similarity!

Modeling and Synthesizing

To finish this primary a part of the story, let’s create a model for a dataset and use the model to generate synthetic data. We will then use the Maximum Similarity Test to match the synthetic and observed sets.

The dataset on the left of Figure 4 below is just Dataset 1 from above. The dataset on the precise (Dataset 3) is the synthetic dataset. (Now we have estimated the distribution as a Gaussian mixture, but that’s not necessary).

**Figure 4.** Observed dataset (left) and Synthetic dataset (right). [Image by Author]

Listed here are the typical similarities and histograms:

**Figure 5.** Distribution of maximum intra- and cross-set similarities for Datasets 1 and three. [Image by Author]

The three averages are an identical to a few significant figures, and the three histograms are very similar. Due to this fact, in line with the Maximum Similarity Test, each datasets can reasonably be considered random samples from the identical parent distribution. Our synthetic data generation exercise has been successful, and now we have achieved the trifecta — fidelity, utility, and privacy.

[]

Part 2— Real Datasets, Real Generators

The dataset used in Part 1 is straightforward and could be easily modeled with just a mix of Gaussians. Nonetheless, most real-world datasets are way more complex. On this a part of the story, we are going to apply several synthetic data generators to some popular real-world datasets. Our primary focus is on comparing the distributions of maximum similarities inside and between the observed and artificial datasets to know the extent to which they could be considered random samples from the identical parent distribution.

The six datasets originate from the UCI repository² and are all popular datasets which have been widely utilized in the machine learning literature for many years. All are mixed-type datasets, and were chosen because they vary of their balance of categorical and numerical features.

The six generators are representative of the major approaches utilized in synthetic data generation: copula-based, GAN-based, VAE-based, and approaches using sequential imputation. CopulaGAN³, GaussianCopula, CTGAN³ and TVAE³ are all available from the libraries⁴, synthpop⁵ is obtainable as an open-source R package, and ‘UNCRi’ refers back to the synthetic data generation tool developed under the (UNCRi) framework⁶. All generators were used with their default settings.

Table 1 shows the typical maximum intra- and cross-set similarities for every generator applied to every dataset. Entries highlighted in red are those by which privacy has been compromised (i.e., the typical maximum cross-set similarity exceeds the typical maximum intra-set similarity on the observed data). Entries highlighted in green are those with the highest average maximum cross-set similarity (not including those in red). The last column shows the results of performing a (TSTR) test, where a classifier or regressor is trained on the synthetic examples and tested on the true (observed) examples. The Boston Housing dataset is a regression task, and the mean absolute error (MAE) is reported; all other tasks are classification tasks, and the reported value is the realm under ROC curve (AUC).

**Table 1.** Average maximum similarities and TSTR result for six generators on six datasets. The values for TSTR are MAE for Boston Housing, and AUC for all other datasets. [Image by Author]

The figures below display, for every dataset, the distributions of maximum intra- and cross-set similarities corresponding to the generator that attained the best average maximum cross-set similarity (excluding those highlighted in red above).

**Figure 6.** Distribution of maximum similarities for synthpop on **Boston Housing** dataset. [Image by Author]

**Figure 7.** Distribution of maximum similarities for synthpop on **Census Income** dataset. [Image by Author]

**Figure 8.** Distribution of maximum similarities for UNCRi on **Cleveland Heart Disease** dataset. [Image by Author]

**Figure 9.** Distribution of maximum similarities for UNCRi on **Credit Approval** dataset. [Image by Author]

**Figure 10.** Distribution of maximum similarities for UNCRi on Iris dataset. [Image by Author]

**Figure 11.** Distribution of average similarities for TVAE on **Wisconsin Breast Cancer** dataset. [Image by Author]

From the table, we will see that for those generators that didn’t breach privacy, the typical maximum cross-set similarity could be very near the typical maximum intra-set similarity on observed data. The histograms show us the distributions of those maximum similarities, and we will see that most often the distributions are clearly similar — strikingly so for datasets reminiscent of the Census Income dataset. The table also shows that the generator that achieved the best average maximum cross-set similarity for every dataset (excluding those highlighted in red) also demonstrated best performance on the TSTR test (again excluding those in red). Thus, while we will never claim to have discovered the ‘true’ underlying distribution, these results exhibit that probably the most effective generator for every dataset has captured the crucial features of the underlying distribution.

Privacy

Only two of the seven generators displayed issues with privacy: synthpop and TVAE. Each of those breached privacy on three out of the six datasets. In two instances, specifically TVAE on Cleveland Heart Disease and TVAE on Credit Approval, the breach was particularly severe. The histograms for TVAE on Credit Approval are shown below and exhibit that the synthetic examples are far too just like one another, and likewise to their closest neighbors within the observed data. The model is a very poor representation of the underlying parent distribution. The explanation for this will likely be that the Credit Approval dataset comprises several numerical features which are extremely highly skewed.

**Figure 12.** Distribution of average maximum similarities for TVAE on **Credit Approval dataset**. [Image by Author]

Other observations and comments

The 2 GAN-based generators — CopulaGAN and CTGAN — were consistently among the many worst performing generators. This was somewhat surprising given the immense popularity of GANs.

The performance of GaussianCopula was mediocre on all datasets except Wisconsin Breast Cancer, for which it attained the equal-highest average maximum cross-set similarity. Its unimpressive performance on the Iris dataset was particularly surprising, on condition that it is a quite simple dataset that may easily be modeled using a mix of Gaussians, and which we expected could be well-matched to Copula-based methods.

The generators which perform most consistently well across all datasets are synthpop and UNCRi, which each operate by sequential imputation. Which means they only ever must estimate and sample from a univariate conditional distribution (e.g., (₇|₁, ₂, …)), and this is often much easier than modeling and sampling from a multivariate distribution (e.g., (₁, ₂, ₃, …)), which is (implicitly) what GANs and VAEs do. Whereas synthpop estimates distributions using decision trees (that are the source of the overfitting that synthpop is susceptible to), the UNCRi generator estimates distributions using a nearest neighbor-based approach, with hyper-parameters optimized using a cross-validation procedure that forestalls overfitting.

Conclusion

Synthetic data generation is a brand new and evolving field, and while there are still no standard evaluation techniques, there’s consensus that tests should cover fidelity, utility and privacy. But while each of those is significant, they usually are not on an equal footing. For instance, an artificial dataset may achieve good performance on fidelity and utility but fail on privacy. This doesn’t give it a ‘two out of three’: if the synthetic examples are too near the observed examples (thus failing the privacy test), the model has been overfitted, rendering the fidelity and utility tests meaningless. There was an inclination amongst some vendors of synthetic data generation software to propose single-score measures of performance that mix results from a mess of tests. This is actually based on the identical ‘two out of three’ logic.

If an artificial dataset could be considered a random sample from the identical parent distribution because the observed data, then we cannot do any higher — now we have achieved maximum fidelity, utility and privacy. The Maximum Similarity Test provides a measure of the extent to which two datasets could be considered random samples from the identical parent distribution. It is predicated on the straightforward and intuitive notion that if an observed and an artificial dataset are random samples from the identical parent distribution, instances ought to be distributed such that an artificial instance is as similar on average to its closest observed instance as an observed instance is analogous on average to its closest observed instance.

We propose the next single-score measure of synthetic dataset quality:

The closer this ratio is to 1 — without exceeding 1 — the higher the standard of the synthetic data. It should, after all, be accompanied by a sanity check of the histograms.

References

[1] Gower, J. C. (1971). A general coefficient of similarity and a few of its properties. Biometrics, 27(4), 857–871.

[2] Dua, D. & Graff, C., (2017). , Available at: http://archive.ics.uci.edu/ml.

[3] Xu, L., Skoularidou, M., Cuesta-Infante, A. and Veeramachaneni., K. Modeling Tabular data using Conditional GAN. NeurIPS, 2019.

[4] Patki, N., Wedge, R., & Veeramachaneni, K. (2016). The synthetic data vault. In (pp. 399–410). IEEE.

[5] Nowok, B., Raab G.M., Dibben, C. (2016). “synthpop: Bespoke Creation of Synthetic Data in R.” , 74(11), 1–26.

[6] http://skanalytix.com/uncri-framework

[7] Harrison, D., & Rubinfeld, D.L. (1978). Boston Housing Dataset. Kaggle. https://www.kaggle.com/c/boston-housing. Licensed for industrial use under the CC: Public Domain license.

[8] Kohavi, R. (1996). Census Income. UCI Machine Learning Repository. archive.ics.uci.edu/dataset/20/census+income . Licensed for industrial use under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

[9] Janosi, A., Steinbrunn, W., Pfisterer, M. and Detrano, R. (1988). Heart Disease. UCI Machine Learning Repository. archive.ics.uci.edu/dataset/45/heart+disease . Licensed for industrial use under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

[10] Quinlan, J.R. (1987). Credit Approval. UCI Machine Learning Repository. archive.ics.uci.edu/dataset/27/credit+approval . Licensed for industrial use under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

[11] Fisher, R.A. (1988). Iris. UCI Machine Learning Repository. archive.ics.uci.edu/dataset/53/iris . Licensed for industrial use under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

[12] Wolberg, W., Mangasarian, O., Street, N. and Street,W. (1995). Breast Cancer Wisconsin (Diagnostic). UCI Machine Learning Repository. archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic . Licensed for industrial use under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Evaluating Synthetic Data — The Million Dollar Query