Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration with imbalanced-learn #14

Open
glemaitre opened this issue Nov 11, 2019 · 10 comments
Open

Integration with imbalanced-learn #14

glemaitre opened this issue Nov 11, 2019 · 10 comments

Comments

@glemaitre
Copy link

@gykovacs I was wondering if you would be interested in an integration of some of the algorithm in imbalanced-learn. It would be really nice to have more variant in imbalanced-learn and actually use your benchmark to have a better idea of what to include.

I was wondering if it would also make sense to compare other methods (e.g. under-sampling) to have a big picture of what is actually working globally.

@gykovacs
Copy link
Member

Hi @glemaitre, absolutely, I was planning for a long time to contact you about this, but you were faster.

I think imbalanced-learn is a fairly mature package, so we definitely shouldn't make smote-variants a dependency of imbalanced-learn, rather, we should select some techniques and translate the codes or reimplement them following the super high-quality standards of imbalanced-learn. In my benchmarking, I have arrived to 6 methods which finish in the top 3 places on various types of datasets, I think these 6 should prove useful in various applications: polynom-fit-SMOTE, ProWSyn, SMOTE-IPF, Lee, SMOBD, G-SMOTE. Alternatively, shooting for the top 3, we could go for polynom-fit-SMOTE, ProWSyn and SMOTE-IPF.

I absolutely agree with the benchmarking of other techniques, too, honestly, this would have been my next project in this topic. I can refine and generalize the evaluation framework quickly. I think we should select the scope (the methods of interest) properly and we could kick off something like this very quickly.

I was also thinking about creating some sort of a "super-wrapper" package, which would wrap oversampling, ensemble, and cost-sensitive learning techniques, providing a somewhat standardized interface, exactly for the ease of benchmarking and experimentation. The benchmarking framework would fit this super-wrapper package pretty well.

Any comments are welcome!

@glemaitre
Copy link
Author

We are absolutely on the same line

I have arrived in 6 methods which finish in the top 3 places on various types of datasets

I think that this is the way to go.

On our side, I think that we can become more conservative for including new SMOTE variants. We can first make implement them in smote_variant, if not already present, and use the benchmark for inclusion. It will help us a lot on the documentation side, justifying the included models and the way they work. We can always refer to smote_variant for people who want to try more exotic versions.

I absolutely agree with the benchmarking of other techniques, too, honestly, this would have been my next project in this topic. I can refine and generalize the evaluation framework quickly. I think we should select the scope (the methods of interest) properly and we could kick off something like this very quickly.

It always has been an objective of @chkoar and myself but we lack some time-bandwidth lately. Reusing some infrastructure would be really useful.

I was also thinking about creating some sort of a "super-wrapper" package, which would wrap oversampling, ensemble, and cost-sensitive learning techniques, providing a somewhat standardized interface, exactly for the ease of benchmarking and experimentation. The benchmarking framework would fit this super-wrapper package pretty well.

This would need to be discussed more in details but it could be one way to go.

Regarding cost-sensitive methods, we were thinking about including some. In some way, we thought to trigger imbalanced-learn 1.0.0 to reorganised the module to take into account different approaches.

@gykovacs
Copy link
Member

Great! In order to improve the benchmarking, I try to set up some sort of a fully reproducible auto-benchmarking system as some CI/CD job. I feel like this would be the right way to keep the evaluation transparent and fully reproducible. I also think in this way smote-variants can do a good job as an experimentation sandbox behind imblearn.

@glemaitre
Copy link
Author

Regarding a continuous benchmark, it is really what I had in mind: scikit-learn-contrib/imbalanced-learn#646 (comment)
@chkoar is more interested in implementing all possible methods and let the user choose. I would at first prefer to reduce the number of samplers. I would consider the first option valid only if we have a good continuous benchmark running and strong documentation referring to it.

How much resources fo your benchmark requires? How long is it taking to run the experiment?

@gykovacs
Copy link
Member

gykovacs commented Nov 18, 2019

Well, the experiment I run and describe in the paper took something like 3 weeks on a 32 core AWS instance, involving 85 methods with 35 different parameter settings, 4 classifiers on top of that with 6 different parameter settings for each, and a repeated k-fold cross validation with 5 splits and 3 repeats, all of that involving 104 datasets.

EDIT:
Training the classifiers on top of the various oversampling strategies takes 80% of the time.

That's clearly too much computational work, but the majority of it was caused by 5-10 "large" datasets and 3-5 very slow, evolutionary oversampling techniques. I think that

  1. reducing the 35 parameter settings to, say, 15,
  2. the classifier parameters combinations to about 3-4,
  3. reducing the datasets to 60-70 small ones
  4. reducing the number of repeats in the repeated k-fold cross-validation
  5. and setting some reasonable timeout for each method

could reduce the work to a couple of hours on a 32-64 core instance.

@chkoar
Copy link

chkoar commented Nov 18, 2019

@glemaitre @gykovacs IMHO the methods that we have to implement or include in imblearn and what method the user will pick is completely unrelated things. We already know that plain SMOTE will do the job. But, since we have the no free lunch theorem I believe that we should not care which is the best to include. We could prioritize by the number of citations (I do not want to set a threshold) or some other thing. For me we need a benchmark just for the timings and we should commit on that. imblearn should have the fastest and accurate (as described in papers) implementations. That's my two cents.

@gykovacs
Copy link
Member

@chkoar If we target well-described, and established methods (which appeared in highly cited journals), the number of potential techniques to include will drop to about 20-30. On the other hand, in my experience, these are typically not the best performers in average - but in the same time, "average performance" is always questionable due to the no free lunch.

Seemingly, the question is whether we believe the outcome of a reasonable benchmark. I think it might make sense, as the methods users look for should perform well on the "smooth" problems related to real classification datasets, and this might be captured by a benchmark dataset.

One more remark on my experiences: usually less-established, simple methods were found to be robust enough to provide acceptable performance on all datasets. These are usually described in hard-to-access, super short conference papers.

@chkoar
Copy link

chkoar commented Nov 18, 2019

@chkoar If we target well-described, and established methods (which appeared in highly cited journals), the number of potential techniques to include will drop to about 20-30. On the other hand, in my experience, these are typically not the best performers in average - but in the same time, "average performance" is always questionable due to the no free lunch.

As I said, I didn't mean about inclusion but for prioritization. So we will not have a bunch of methods initially, as it's @glemaitre concern if I understood correctly.

One more remark on my experiences: usually less-established, simple methods were found to be robust enough to provide acceptable performance on all datasets. These are usually described in hard-to-access, super short conference papers.

I totally agree. That's why I do not find a reason for a method not to be included in the imblearn and only just the top (most cited, best performing across classifiers, etc) ones. As you said always should be a case where a specific over-sampler could perform well.

If that is was the case the main scikit-learn package would have only 5 methods. That's my other two cents.

@gykovacs
Copy link
Member

I did some experimentation with CircleCI, it doesn't seem to be suitable for an automated benchmarking in the community subscription plan, too much of a workload even if one relatively small dataset is used.

I also got concerned about my previous idea to use CI/CD for benchmarking. I can imagine a standalone benchmarking solution, which can be installed to any machine, checks out packages and datasets providing some quasi-standard interfaces for benchmarking, runs experiments where code has changed, and publishes the results on a local web-server.

Maintaining the solution and linking something like this to any documentation page doesn't seem to be a burden, yet the solution is flexible and can be moved around in the clouds easily when needed.

I think my company could even finance an instance like this. The main difference compared to CI/CD is that it would run the benchmarking regularly, not on pull requests or any other hooks.

Any comments are welcome! Do you have experience or anything particular on your mind regarding a proper benchmarking solution?

@zoj613
Copy link

zoj613 commented Aug 4, 2021

@gykovacs Would you be interested in testing your benchmarks on the newer LoRAS and ProWRAS implementations I wrote here: https://github.com/zoj613/pyloras ? I do not think they are implemented in either of the 2 packages.

They do seem promising, at least to my untrained eye.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants