Assessing the Use of Synthetic Tabular Data: An Analysis of the Adult Dataset

Abdullah Balamash, King Abdulaziz University, SAUDI ARABIAFollow

Abstract

In this study, we evaluate the effectiveness of synthetic data generated using the Gaussian Copula (GC) and Conditional Tabular Generative Adversarial Network (CTGAN) models within the Synthetic Data Vault (SDV) framework by applying it to the Adult data set [1]. This is done in three steps:(1) Training and Testing on Real Data (Baseline), (2) Training on Synthetic Data and Testing on Real Data (TSTR), (3) Training on (Synthetic + Real Data) and Testing on Real Data (Augmentation with Real Data), (4) Minority Class Oversampling. The TSTR results show that synthetic data preserves the predictive features of baseline data. Augmentation with real data does not improve the performance when there is enough real data. When we have a class-imbalance scenario, synthetic minority oversampling improves the recall for the minority class (e.g., from 0.60 to 0.80 with CTGAN) at the expense of the precision in a way that underperforms traditional techniques such as random oversampling and class weighting. Overall, our findings suggest that synthetic data can be used when we do not have enough data, but it is not good enough to address class imbalance.

First Page

Last Page

Recommended Citation

Balamash, Abdullah (2026) "Assessing the Use of Synthetic Tabular Data: An Analysis of the Adult Dataset," Journal of King Abdulaziz University: Engineering Sciences: Vol. 36: Iss. 1, Article 3.
DOI: https://doi.org/10.64064/1658-4260.1020

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Download

COinS

Assessing the Use of Synthetic Tabular Data: An Analysis of the Adult Dataset

Authors

Abstract

First Page

Last Page

Recommended Citation

Creative Commons License

Share

Search