Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

synthetic data format is wrong based up on real data -Please help #1943

Closed
Vasanthpravin opened this issue Apr 22, 2024 · 2 comments
Closed
Labels
resolution:WAI The software is working as intended

Comments

@Vasanthpravin
Copy link

sdv versiom-1.12.0
databricks 13.3 LTS

dp_pandas = df.toPandas()

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(dp_pandas)

synthesizer = GaussianCopulaSynthesizer(metadata=metadata)
synthesizer.fit(data=df_pandas)

synthetic_data = synthesizer.sample(num_rows=50)

display(synthetic_data)
dp_pandas.info(verbose = True, null_counts = False)

Ouptut is coming

image
Person id PhoneId
sdv-pii-btwry sdv-id-0

both columns are number but its generating sdv like that.Please help

@Vasanthpravin Vasanthpravin added bug Something isn't working new Automatic label applied to new issues labels Apr 22, 2024
@srinify
Copy link
Contributor

srinify commented May 2, 2024

Hi there @Vasanthpravin

When you run metadata.detect_from_dataframe(dp_pandas), SDV does a best-guess effort to automatically infer the metadata (and hence, the sdtypes) for all of your columns.

However, this process isn't perfect and we always recommend double checking the metadata to make sure it matches what you expect. You can display the metadata object to get a read-out of the auto-detected sdtypes:

print(metadata)

Then, you can update the sdtype of multiple columns at once using the update_columns_metadata method from SingleTableMetadata:

metadata.update_columns_metadata(
  column_metadata = {
    'personid': { 'sdtype': 'numerical' },
    'phoneid': { 'sdtype': 'phone_number' }
  }
)

Then you can create your synthesizer object, fit the model, and sample:

synthesizer = GaussianCopulaSynthesizer(metadata=metadata)
synthesizer.fit(data=df_pandas)
synthetic_data = synthesizer.sample(num_rows=50)

@srinify srinify added under discussion Issue is currently being discussed and removed bug Something isn't working new Automatic label applied to new issues labels May 2, 2024
@srinify
Copy link
Contributor

srinify commented May 17, 2024

Hi there @Vasanthpravin I'm closing out this issue for now, as it seems like there isn't a clear bug here. But let me know if you're still running into the issue or uncover a related bug and we can re-open the issue!

@srinify srinify closed this as completed May 17, 2024
@srinify srinify added resolution:WAI The software is working as intended and removed under discussion Issue is currently being discussed labels May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
resolution:WAI The software is working as intended
Projects
None yet
Development

No branches or pull requests

2 participants