•  
  •  
 

Abstract

In this study, we evaluate the effectiveness of synthetic data generated using the Gaussian Copula (GC) and Conditional Tabular Generative Adversarial Network (CTGAN) models within the Synthetic Data Vault (SDV) framework by applying it to the Adult data set [1]. This is done in three steps:(1) Training and Testing on Real Data (Baseline), (2) Training on Synthetic Data and Testing on Real Data (TSTR), (3) Training on (Synthetic + Real Data) and Testing on Real Data (Augmentation with Real Data), (4) Minority Class Oversampling. The TSTR results show that synthetic data preserves the predictive features of baseline data. Augmentation with real data does not improve the performance when there is enough real data. When we have a class-imbalance scenario, synthetic minority oversampling improves the recall for the minority class (e.g., from 0.60 to 0.80 with CTGAN) at the expense of the precision in a way that underperforms traditional techniques such as random oversampling and class weighting. Overall, our findings suggest that synthetic data can be used when we do not have enough data, but it is not good enough to address class imbalance.

First Page

23

Last Page

28

Creative Commons License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Share

COinS