Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when use SOMO,Why did the two types of samples not reach a balance and the number did not change #39

Open
leaphan opened this issue Apr 23, 2021 · 2 comments

Comments

@leaphan
Copy link

leaphan commented Apr 23, 2021

No description provided.

@gykovacs
Copy link
Member

gykovacs commented Apr 23, 2021

There can be multiple reasons for that. In many cases the authors of a particular SMOTE variant did not cover all the possible corner cases, for example,

  1. all minority samples are treated as noise according to the noise definition of the technique,
  2. the method wants to work with, say, 5 nearest neighbors, but there are only 3 minority samples,
  3. mathematical techniques like self-organizing maps, do not converge,
  4. etc.,

all of these because of the nature of the data is not compatible with the parameter settings and presumptions of the SMOTE variant.

Where I found reasonable resolutions, I implemented them, in those cases when it is unfeasible (for example, determining the 5 closest neighbors when you have only 3 samples in a class), the data is returned unaltered, although I would expect some message in the logs if logging is enabled.

Most likely your data is a corner case of the SOMO implementation with the parameters you used. Adjusting the parameters might lead to a properly operating SOMO.

Also, if you share a minimal working example, I can look into it.

@leaphan
Copy link
Author

leaphan commented Apr 25, 2021

thanks for your reply, i wrote a code like this:

pip install -U imbalanced-learn
pip install smote-variants
import numpy as np
import smote_variants as sv
#import imblearn.datasets as imbd
from imblearn.datasets import fetch_datasets

datasets = fetch_datasets(filter_data=['oil'])
X, y = datasets['oil']['data'], datasets['oil']['target']
[print('Class {} has {} instances'.format(label, count))
for label, count in zip(*np.unique(y, return_counts=True))]

oversampler= sv.SOMO()
X_samp, y_samp= oversampler.sample(X, y)

[print('Class {} has {} instances after oversampling'.format(label, count))
for label, count in zip(*np.unique(y_samp, return_counts=True))]
print(X_samp, y_samp)

and the print result :
Class -1 has 896 instances
Class 1 has 41 instances
Class -1 has 896 instances after oversampling
Class 1 has 41 instances after oversampling
After oversampling, There is no change in the number of two types of samples.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants