Skip to content

Differentially Private Synthetic Data Generation [DP-SDG] - Experimental Setups & Knowledge Base - WORK IN PROGRESS

Notifications You must be signed in to change notification settings

stefanrmmr/differentially_private_synthetic_data

Repository files navigation

Experimental Implementation of DP-WGAN
Differentially Private Synthetic Data Generation

For Continuous Data with binary Targets using the Differentially Private Wasserstein GAN

  1. DP-WGAN Synthetic Data for "Health care: Heart attack possibility" Kaggle Dataset --> view Notebook
  2. DP-WGAN Synthetic Data for "BankNote Authentication UCI" Kaggle Dataset --> view Notebook


Metrics achieved for DP-WGAN on the Heart Disease Dataset


synthdata_sc1

*after multiple attempts using normalized input data, epsilon = approx 3.4 and delta = 1e-5

Process Steps & Key Concepts

  • The data needs to be in csv format and has to be partitioned as train and test before feeding it to the models.
  • Missing values are not supported and needs to replaced appropriately by the user before usage.
  • In case the data has continuous and categorical attributes, it needs to be pre-processed
    (discretization for continuous values/ encoding for categorical attr.)

  • The generative GAN-based ML models are trained using the training dataset.
  • The generative model is used to create a synthetic version of the train dataset
  • To compensate for irregularities multiple GAN-Generator models are trained
  • To compensate for irregularities multiple synthetic datasets are generated,
    the optimal best-performing dataset that yields the max AUC is selected

  • Logistic Regression Classifiers are trained using the real data, as well as, the synthetically generated dataset
  • Both classifiers are evaluated regarding performance on the left-out real test dataset (preserved for evaluation)
  • Relevant Metrics (mainly AUC) and visualizations of correlation-matrices of synthetic datasets were generated

Acknowledgements & Sources

Major parts of this summary notebook were extracted from this BOREALIS Private Data Generation Github repository by BorealisAI. Note that, this Jupyter notebook covers only one (DP-WGAN) of various possible datasets and generative models for differentially private synthetic data generation. The aforementioned analysis aproaches have yielded the following results as extracted from the original notebook. For more information rearding differential privacy specific privacy arguments Delta & Epsylon please refer to this info-page by Microsoft