synthetic data format is wrong based up on real data -Please help #1943

Vasanthpravin · 2024-04-22T08:14:03Z

sdv versiom-1.12.0
databricks 13.3 LTS

dp_pandas = df.toPandas()

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(dp_pandas)

synthesizer = GaussianCopulaSynthesizer(metadata=metadata)
synthesizer.fit(data=df_pandas)

synthetic_data = synthesizer.sample(num_rows=50)

display(synthetic_data)
dp_pandas.info(verbose = True, null_counts = False)

Ouptut is coming

Person id PhoneId
sdv-pii-btwry sdv-id-0

both columns are number but its generating sdv like that.Please help

srinify · 2024-05-02T21:16:03Z

Hi there @Vasanthpravin

When you run metadata.detect_from_dataframe(dp_pandas), SDV does a best-guess effort to automatically infer the metadata (and hence, the sdtypes) for all of your columns.

However, this process isn't perfect and we always recommend double checking the metadata to make sure it matches what you expect. You can display the metadata object to get a read-out of the auto-detected sdtypes:

print(metadata)

Then, you can update the sdtype of multiple columns at once using the update_columns_metadata method from SingleTableMetadata:

metadata.update_columns_metadata(
  column_metadata = {
    'personid': { 'sdtype': 'numerical' },
    'phoneid': { 'sdtype': 'phone_number' }
  }
)

Then you can create your synthesizer object, fit the model, and sample:

synthesizer = GaussianCopulaSynthesizer(metadata=metadata)
synthesizer.fit(data=df_pandas)
synthetic_data = synthesizer.sample(num_rows=50)

srinify · 2024-05-17T13:05:24Z

Hi there @Vasanthpravin I'm closing out this issue for now, as it seems like there isn't a clear bug here. But let me know if you're still running into the issue or uncover a related bug and we can re-open the issue!

Vasanthpravin added bug Something isn't working new Automatic label applied to new issues labels Apr 22, 2024

npatki mentioned this issue Apr 22, 2024

SDV not generated synthetic data as per real data ..Please help its very urgent #1945

Closed

srinify added under discussion Issue is currently being discussed and removed bug Something isn't working new Automatic label applied to new issues labels May 2, 2024

srinify closed this as completed May 17, 2024

srinify added resolution:WAI The software is working as intended and removed under discussion Issue is currently being discussed labels May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

synthetic data format is wrong based up on real data -Please help #1943

synthetic data format is wrong based up on real data -Please help #1943

Vasanthpravin commented Apr 22, 2024

srinify commented May 2, 2024

srinify commented May 17, 2024

synthetic data format is wrong based up on real data -Please help #1943

synthetic data format is wrong based up on real data -Please help #1943

Comments

Vasanthpravin commented Apr 22, 2024

srinify commented May 2, 2024

srinify commented May 17, 2024