Integration with imbalanced-learn #14

glemaitre · 2019-11-11T22:49:36Z

@gykovacs I was wondering if you would be interested in an integration of some of the algorithm in imbalanced-learn. It would be really nice to have more variant in imbalanced-learn and actually use your benchmark to have a better idea of what to include.

I was wondering if it would also make sense to compare other methods (e.g. under-sampling) to have a big picture of what is actually working globally.

The text was updated successfully, but these errors were encountered:

gykovacs · 2019-11-12T13:05:45Z

Hi @glemaitre, absolutely, I was planning for a long time to contact you about this, but you were faster.

I think imbalanced-learn is a fairly mature package, so we definitely shouldn't make smote-variants a dependency of imbalanced-learn, rather, we should select some techniques and translate the codes or reimplement them following the super high-quality standards of imbalanced-learn. In my benchmarking, I have arrived to 6 methods which finish in the top 3 places on various types of datasets, I think these 6 should prove useful in various applications: polynom-fit-SMOTE, ProWSyn, SMOTE-IPF, Lee, SMOBD, G-SMOTE. Alternatively, shooting for the top 3, we could go for polynom-fit-SMOTE, ProWSyn and SMOTE-IPF.

I absolutely agree with the benchmarking of other techniques, too, honestly, this would have been my next project in this topic. I can refine and generalize the evaluation framework quickly. I think we should select the scope (the methods of interest) properly and we could kick off something like this very quickly.

I was also thinking about creating some sort of a "super-wrapper" package, which would wrap oversampling, ensemble, and cost-sensitive learning techniques, providing a somewhat standardized interface, exactly for the ease of benchmarking and experimentation. The benchmarking framework would fit this super-wrapper package pretty well.

Any comments are welcome!

glemaitre · 2019-11-13T12:35:06Z

We are absolutely on the same line

I have arrived in 6 methods which finish in the top 3 places on various types of datasets

I think that this is the way to go.

On our side, I think that we can become more conservative for including new SMOTE variants. We can first make implement them in smote_variant, if not already present, and use the benchmark for inclusion. It will help us a lot on the documentation side, justifying the included models and the way they work. We can always refer to smote_variant for people who want to try more exotic versions.

I absolutely agree with the benchmarking of other techniques, too, honestly, this would have been my next project in this topic. I can refine and generalize the evaluation framework quickly. I think we should select the scope (the methods of interest) properly and we could kick off something like this very quickly.

It always has been an objective of @chkoar and myself but we lack some time-bandwidth lately. Reusing some infrastructure would be really useful.

I was also thinking about creating some sort of a "super-wrapper" package, which would wrap oversampling, ensemble, and cost-sensitive learning techniques, providing a somewhat standardized interface, exactly for the ease of benchmarking and experimentation. The benchmarking framework would fit this super-wrapper package pretty well.

This would need to be discussed more in details but it could be one way to go.

Regarding cost-sensitive methods, we were thinking about including some. In some way, we thought to trigger imbalanced-learn 1.0.0 to reorganised the module to take into account different approaches.

gykovacs · 2019-11-18T08:13:05Z

Great! In order to improve the benchmarking, I try to set up some sort of a fully reproducible auto-benchmarking system as some CI/CD job. I feel like this would be the right way to keep the evaluation transparent and fully reproducible. I also think in this way smote-variants can do a good job as an experimentation sandbox behind imblearn.

glemaitre · 2019-11-18T09:17:36Z

Regarding a continuous benchmark, it is really what I had in mind: scikit-learn-contrib/imbalanced-learn#646 (comment)
@chkoar is more interested in implementing all possible methods and let the user choose. I would at first prefer to reduce the number of samplers. I would consider the first option valid only if we have a good continuous benchmark running and strong documentation referring to it.

How much resources fo your benchmark requires? How long is it taking to run the experiment?

gykovacs · 2019-11-18T09:30:34Z

Well, the experiment I run and describe in the paper took something like 3 weeks on a 32 core AWS instance, involving 85 methods with 35 different parameter settings, 4 classifiers on top of that with 6 different parameter settings for each, and a repeated k-fold cross validation with 5 splits and 3 repeats, all of that involving 104 datasets.

EDIT:
Training the classifiers on top of the various oversampling strategies takes 80% of the time.

That's clearly too much computational work, but the majority of it was caused by 5-10 "large" datasets and 3-5 very slow, evolutionary oversampling techniques. I think that

reducing the 35 parameter settings to, say, 15,
the classifier parameters combinations to about 3-4,
reducing the datasets to 60-70 small ones
reducing the number of repeats in the repeated k-fold cross-validation
and setting some reasonable timeout for each method

could reduce the work to a couple of hours on a 32-64 core instance.

chkoar · 2019-11-18T09:30:52Z

@glemaitre @gykovacs IMHO the methods that we have to implement or include in imblearn and what method the user will pick is completely unrelated things. We already know that plain SMOTE will do the job. But, since we have the no free lunch theorem I believe that we should not care which is the best to include. We could prioritize by the number of citations (I do not want to set a threshold) or some other thing. For me we need a benchmark just for the timings and we should commit on that. imblearn should have the fastest and accurate (as described in papers) implementations. That's my two cents.

gykovacs · 2019-11-18T09:41:24Z

@chkoar If we target well-described, and established methods (which appeared in highly cited journals), the number of potential techniques to include will drop to about 20-30. On the other hand, in my experience, these are typically not the best performers in average - but in the same time, "average performance" is always questionable due to the no free lunch.

Seemingly, the question is whether we believe the outcome of a reasonable benchmark. I think it might make sense, as the methods users look for should perform well on the "smooth" problems related to real classification datasets, and this might be captured by a benchmark dataset.

One more remark on my experiences: usually less-established, simple methods were found to be robust enough to provide acceptable performance on all datasets. These are usually described in hard-to-access, super short conference papers.

chkoar · 2019-11-18T09:54:34Z

@chkoar If we target well-described, and established methods (which appeared in highly cited journals), the number of potential techniques to include will drop to about 20-30. On the other hand, in my experience, these are typically not the best performers in average - but in the same time, "average performance" is always questionable due to the no free lunch.

As I said, I didn't mean about inclusion but for prioritization. So we will not have a bunch of methods initially, as it's @glemaitre concern if I understood correctly.

One more remark on my experiences: usually less-established, simple methods were found to be robust enough to provide acceptable performance on all datasets. These are usually described in hard-to-access, super short conference papers.

I totally agree. That's why I do not find a reason for a method not to be included in the imblearn and only just the top (most cited, best performing across classifiers, etc) ones. As you said always should be a case where a specific over-sampler could perform well.

If that is was the case the main scikit-learn package would have only 5 methods. That's my other two cents.

gykovacs · 2019-11-20T10:08:01Z

I did some experimentation with CircleCI, it doesn't seem to be suitable for an automated benchmarking in the community subscription plan, too much of a workload even if one relatively small dataset is used.

I also got concerned about my previous idea to use CI/CD for benchmarking. I can imagine a standalone benchmarking solution, which can be installed to any machine, checks out packages and datasets providing some quasi-standard interfaces for benchmarking, runs experiments where code has changed, and publishes the results on a local web-server.

Maintaining the solution and linking something like this to any documentation page doesn't seem to be a burden, yet the solution is flexible and can be moved around in the clouds easily when needed.

I think my company could even finance an instance like this. The main difference compared to CI/CD is that it would run the benchmarking regularly, not on pull requests or any other hooks.

Any comments are welcome! Do you have experience or anything particular on your mind regarding a proper benchmarking solution?

zoj613 · 2021-08-04T08:56:12Z

@gykovacs Would you be interested in testing your benchmarks on the newer LoRAS and ProWRAS implementations I wrote here: https://github.com/zoj613/pyloras ? I do not think they are implemented in either of the 2 packages.

They do seem promising, at least to my untrained eye.

This was referenced Nov 17, 2019

[WIP] ENH: SPIDER Sampling Algorithm scikit-learn-contrib/imbalanced-learn#603

Open

[MRG] ENH: safe-level SMOTE scikit-learn-contrib/imbalanced-learn#626

Open

This was referenced Feb 18, 2021

[MRG] Add the SMOTE-RSB oversampling technique scikit-learn-contrib/imbalanced-learn#789

Closed

New methods scikit-learn-contrib/imbalanced-learn#105

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration with imbalanced-learn #14

Integration with imbalanced-learn #14

glemaitre commented Nov 11, 2019

gykovacs commented Nov 12, 2019

glemaitre commented Nov 13, 2019

gykovacs commented Nov 18, 2019

glemaitre commented Nov 18, 2019

gykovacs commented Nov 18, 2019 •

edited

chkoar commented Nov 18, 2019

gykovacs commented Nov 18, 2019

chkoar commented Nov 18, 2019

gykovacs commented Nov 20, 2019

zoj613 commented Aug 4, 2021 •

edited

Integration with imbalanced-learn #14

Integration with imbalanced-learn #14

Comments

glemaitre commented Nov 11, 2019

gykovacs commented Nov 12, 2019

glemaitre commented Nov 13, 2019

gykovacs commented Nov 18, 2019

glemaitre commented Nov 18, 2019

gykovacs commented Nov 18, 2019 • edited

chkoar commented Nov 18, 2019

gykovacs commented Nov 18, 2019

chkoar commented Nov 18, 2019

gykovacs commented Nov 20, 2019

zoj613 commented Aug 4, 2021 • edited

gykovacs commented Nov 18, 2019 •

edited

zoj613 commented Aug 4, 2021 •

edited