Add support for generated columns when conditional sampling #1994

srinify · 2024-05-07T13:45:53Z

Problem Description

Every column that the SDV synthesizes falls into 1 of 2 buckets:

Modeled Columns: The data in these columns are modeled, eg. numerical, datetime, boolean or categorical data
Generated Columns: The data in these columns are generated from scratch without modeling, etc. primary keys, PII values

Currently, you can't conditionally sample using ID, primary key, or other generated columns.

Expected behavior

As a user, I expect to be able to conditionally sample on any column(s) I see fit.

Additional context

I expect the following code to work:

import pandas as pd
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.datasets.demo import download_demo

data, metadata = download_demo(
    modality='single_table',
    dataset_name='census_extended'
)

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data)
synthesizer.sample_remaining_columns(data[['id', 'workclass']].head(10))

Related to this issue: #1096

The text was updated successfully, but these errors were encountered:

srinify added feature request Request for a new feature feature:sampling Related to generating synthetic data after a model is built labels May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for generated columns when conditional sampling #1994

Add support for generated columns when conditional sampling #1994

srinify commented May 7, 2024 •

edited

Add support for generated columns when conditional sampling #1994

Add support for generated columns when conditional sampling #1994

Comments

srinify commented May 7, 2024 • edited

Problem Description

Expected behavior

Additional context

srinify commented May 7, 2024 •

edited