GitHub - connectkishan1/feature-engineering: Imbalanced Classes, Resampling techniques, filling null value, Date,Time, Alphanumeric data

1. Resampling Imbalanced Data

In practice, you will encounter imbalanced data more often than not. This does not necessarily have to be a problem if your target only has a slight imbalance.

You could then resolve it by using proper validation measures for the data such as Balanced Accuracy, Precision-Recall Curves or F1-score.

Unfortunately, this is not always the case and your target variable might be highly imbalanced (e.g., 10:1).

Instead, Use SMOTE algorithm from 6 effective ways to handle imbalanced classes

1. Up-sample Minority Class

You can read about it Here

2. Down-sample Majority Class

You can read about it Here

3. Change Your Performance Metric

we recommend Area Under ROC Curve (AUROC). You can read more about it Here

4. Penalize Algorithms (Cost-Sensitive Training)

its increase the cost of classification mistakes on the minority class.A popular algorithm for this technique is Penalized-SVM
model = SVC(kernel='linear', class_weight='balanced',probability=True) # penalize-balanced

5. Use Tree-Based Algorithms

Decision trees often perform well on imbalanced datasets because their hierarchical structure allows them to learn signals from both classes.

6. SMOTE Method

SMOTE (synthetic minority oversampling technique) is one of the most commonly used oversampling methods to solve the imbalance problem.used to increase the samples in a minority class.
It generates new samples by looking at the feature space of the target and detecting nearest neighbors. Then, it simply selects similar samples and changes a column at a time randomly within the feature space of the neighboring samples.
The module to implement SMOTE can be found within the imbalanced-learn package. You can simply import the package and apply a fit_transform:

As you can see the model successfully oversampled the target variable. There are several strategies that you can take when oversampling using SMOTE:

'minority': resample only the minority class;
'not minority': resample all classes but the minority class;
'not majority': resample all classes but the majority class;
'all': resample all classes;

When dict, the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.
I chose to use a dictionary to specify the extent to which I wanted to oversample my data.
Additional tip 1: If you have categorical variables in your dataset SMOTE is likely to create values for those variables that cannot happen. For example, if you have a variable called isMale, which could only take 0 or 1, then SMOTE might create 0.365 as a value.
Instead, you can use SMOTENC which takes into account the nature of categorical variables. This version is also available in the imbalanced-learnpackage.
Additional tip 2: Make sure to oversample after creating the train/test split so that you only oversample the train data. You typically do not want to test your model on synthetic data.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Data		Data
Feature Extraction.ipynb		Feature Extraction.ipynb
Outliers_Detection_&_Removal.ipynb		Outliers_Detection_&_Removal.ipynb
README.md		README.md
SMOTE_Method_for_imbalance_Data.ipynb		SMOTE_Method_for_imbalance_Data.ipynb
distribution.png		distribution.png
resampled.png		resampled.png
word2vec.ipynb		word2vec.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1. Resampling Imbalanced Data

1. Up-sample Minority Class

2. Down-sample Majority Class

3. Change Your Performance Metric

4. Penalize Algorithms (Cost-Sensitive Training)

5. Use Tree-Based Algorithms

6. SMOTE Method

About

Releases

Packages

Languages

connectkishan1/feature-engineering

Folders and files

Latest commit

History

Repository files navigation

1. Resampling Imbalanced Data

1. Up-sample Minority Class

2. Down-sample Majority Class

3. Change Your Performance Metric

4. Penalize Algorithms (Cost-Sensitive Training)

5. Use Tree-Based Algorithms

6. SMOTE Method

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages