-
Notifications
You must be signed in to change notification settings - Fork 286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Repeated sequence_index values in specific situations #2004
Comments
Thanks for filing @Ng-ms. Adding a link to our previous conversation. As with all these issues, it is very beneficial to us if you can provide some data to help us replicate it. Otherwise, it may take us longer to find the root cause. Summary
Next StepsThe SDV team will investigate the cause of this issue. Internally, I would want to check for two things:
|
Thank you, @npatki, unfortunately, I am not able to share the data since it belongs to a hospital, min_max_scaler = MinMaxScaler() while here in the real data, Series([], dtype: int64) , |
Hi @Ng-ms thanks. We will try to replicate it but just be aware that it may take us some time since we don't have a dataset to be working with. Thank you for running the diagnostic. These scores are supposed to be 1.0. Just to confirm in this example: |
hello, yes pre_date is a context column and visit_date is a sequence index |
Hi there @Ng-ms I tried to replicate this with our demo datasets and wasn't able to yet unfortunately. If you're able to meet us halfway by trying to modify one of our demo datasets (deleting rows, adding context columns, formatting date time values in a similar way to your dataset, etc) to force this error to occur, that would be a massive help for us. The following code snippet can get you started with a demo dataset that has a datetime column:
Either way, we'll keep this issue open to see if others have run into it and can help us reproduce this. |
I can confirm the issue @Ng-ms described, as I have the same problems with my dataset. Environment: I'm also not able to share the data as it's confidential. However, I can provide some more information and a guess what's causing the issue. The dataset I use has the following structure and properties:
The synthesizer is trained via (more epochs were also tested, but don't change the issue): The synthetic data is generated via: What I've noticed is the following:
I guess the problem lies in the way the synthetic data sequences are generated and here I have to guess how the process probably works, because I haven't looked into the code details:
If my guess is correct I would suggest an option to exclude the sequence_index column from the enforce_min_max_values option. Otherwise, the algorithm has to plan ahead when starting a sequence if it will reach the max. limit and need to adapt e.g. the used gap based on the number of entries which will be generated in the sequence. |
@Scit3ch very thorough & excellent analysis! You seem to be right about this and I was able to replicate this on a much simpler dataset where I experimented with decreasing In simpler terms, trying to synthesize a sequence of 7 rows where one of the original sequences only had 5 rows caused SDV to generate duplicates and essentially "run out" of dates to generate because of Context columns don't seem to matter here. Thanks for all the help! I will close out this issue so we can track the work and discussion better in the bug report I've opened: #2031 Some short term workarounds with tradeoffs:
|
Environment Details
Please indicate the following details about the environment in which you found the bug:
after applying the workaround in #1973 , the sequnece_index is not Nan anymore, but I notice a repeated sequence index (In my dataset is the date of the visit ) so this date should be unique because it is a unique visit,
The text was updated successfully, but these errors were encountered: