Skip to content

Imbalanced Classes, Resampling techniques, filling null value, Date,Time, Alphanumeric data

Notifications You must be signed in to change notification settings

connectkishan1/feature-engineering

Repository files navigation

1. Resampling Imbalanced Data

In practice, you will encounter imbalanced data more often than not. This does not necessarily have to be a problem if your target only has a slight imbalance.

You could then resolve it by using proper validation measures for the data such as Balanced Accuracy, Precision-Recall Curves or F1-score.

Unfortunately, this is not always the case and your target variable might be highly imbalanced (e.g., 10:1).

Instead, Use SMOTE algorithm from 6 effective ways to handle imbalanced classes

1. Up-sample Minority Class

  • You can read about it Here

2. Down-sample Majority Class

  • You can read about it Here

3. Change Your Performance Metric

  • we recommend Area Under ROC Curve (AUROC). You can read more about it Here

4. Penalize Algorithms (Cost-Sensitive Training)

  • its increase the cost of classification mistakes on the minority class.A popular algorithm for this technique is Penalized-SVM
  • model = SVC(kernel='linear', class_weight='balanced',probability=True) # penalize-balanced

5. Use Tree-Based Algorithms

  • Decision trees often perform well on imbalanced datasets because their hierarchical structure allows them to learn signals from both classes.

6. SMOTE Method

  • SMOTE (synthetic minority oversampling technique) is one of the most commonly used oversampling methods to solve the imbalance problem.used to increase the samples in a minority class.
  • It generates new samples by looking at the feature space of the target and detecting nearest neighbors. Then, it simply selects similar samples and changes a column at a time randomly within the feature space of the neighboring samples.
  • The module to implement SMOTE can be found within the imbalanced-learn package. You can simply import the package and apply a fit_transform:

CodingCoding

  • As you can see the model successfully oversampled the target variable. There are several strategies that you can take when oversampling using SMOTE:
  1. 'minority': resample only the minority class;
  2. 'not minority': resample all classes but the minority class;
  3. 'not majority': resample all classes but the majority class;
  4. 'all': resample all classes;
  • When dict, the keys correspond to the targeted classes. The values correspond to the desired number of samples for each targeted class.

  • I chose to use a dictionary to specify the extent to which I wanted to oversample my data.

  • Additional tip 1: If you have categorical variables in your dataset SMOTE is likely to create values for those variables that cannot happen. For example, if you have a variable called isMale, which could only take 0 or 1, then SMOTE might create 0.365 as a value.

  • Instead, you can use SMOTENC which takes into account the nature of categorical variables. This version is also available in the imbalanced-learnpackage.

  • Additional tip 2: Make sure to oversample after creating the train/test split so that you only oversample the train data. You typically do not want to test your model on synthetic data.

Releases

No releases published

Packages

No packages published