-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lossvalues are good, but the quality of the synthetic data is bad... How?? #2010
Comments
Hi there @ilkayyuksel 👋 Do you mind sharing some visualizations of what your marginal distributions look like? This would help us understand if they're bi-model, skewed, etc. In general, the loss chart looks good and that can correlate with high quality synthetic data but it's not always the case with CTGAN. GAN's in general can be cumbersome to tweak (which is often why we point people to using Gaussian Copulas instead!) but it seems like this is the approach you'll need to take. Some potential avenues to consider:
|
Hi there @ilkayyuksel I'm closing this issue out for now since I haven't heard from you in a while. But comment here and we can re-open if you still need guidance! I'd also encourage you to join our Slack community if you aren't there already :) |
I am using the CTGAN Model for my masterthesis, i want to generate synthetic data using dataset UNSW_NB15 (intrusion detection system dataset, zo it contains attacks). I want to generate synthetic data of 'Generic attacks', which counts 58871 real samples to train with.
I have trained my CTGAN model with the following code:
lossvalues:
Those are my lossvalues for my generator and discriminator, if you look at the discussion #980 , you would expect really good synthetic data generated by the CTGAN Model.
But if I use the metrics from SDV, comparing the real data with the synthetic data, the scores from the metrics are bad:
KS_complement:
TV_complement:
The visual distributions of each feature are also bad.
Can you help me? what did I wrong? Why have the fake samples bad quality?
PS. If I use SMOTE, the scores of the SDV metrics are better. But I have to use a GAN model...
The text was updated successfully, but these errors were encountered: