Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEAGO : negative values for categorical features inside the data #34

Open
eponraj27392 opened this issue Aug 19, 2020 · 3 comments
Open

Comments

@eponraj27392
Copy link

hello,
I am working with a dataset which contains both categorical & continuous features.
On implementing DEAGO, it output -ve values for categorical features after sampling ??
Could someone let me know, whether DEAGO doen't support cat_features in the dataset ??

@gykovacs
Copy link
Member

Hi, only a handful of oversampling techniques considers categorical variables, and even so, it is not implemented in the smote-variants package. Most of the oversampling techniques operate in the Euclidean space, treating all attributes continuous. A commonly followed way to use oversampling techniques with categorical variables is encoding the categorical variables, for exampleyusing one-hot encoding. Then, oversamoling techniques might end up in feature values which are fractional numbers, but from the regression point of view it is not a problem as it just expresses that the samole might be somewhere between the two categories.

Alternatively, omce the one-hot encoding is done and the oversampling is applied, you might convert the oversampled fractional values to crisp binary ones to keep the categorical nature.

@eponraj27392
Copy link
Author

Since I found SMOTENC from imbalanced learn library which can take cat_feature index as input, I thought this libraray too have some attributes to mention about the cat_features.

@gykovacs
Copy link
Member

SMOTENC is just a hack to apply SMOTE to categorical data. If you encode your categorical features by one-hot encoding and standardize the continuous features to have the standard deviation 1, vanilla SMOTE and all other smote variants (including DEAGO) will operate in the same metric space as SMOTENC. So there is no need for special arguments to pass categorical features, you just need to encode them properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants