Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

automatic detection of multioutput datasets #1001

Open
wants to merge 7 commits into
base: development
Choose a base branch
from

Conversation

jhmenke
Copy link
Contributor

@jhmenke jhmenke commented Jan 14, 2020

What does this PR do?

Slight variation of #903 depending on the shape of the target array.

What are the relevant issues?

#971

@coveralls
Copy link

Coverage Status

Coverage remained the same at 96.747% when pulling 3dd6475 on jhmenke:development into 410d88c on EpistasisLab:development.

@weixuanfu
Copy link
Contributor

Neat! Thank you for this PR. I think it should work for checking multi output dataset. But I checked this link, I think that not all regressors in our default config are supporting multi-output regression. I think we need a workaround for this issue.

@jhmenke
Copy link
Contributor Author

jhmenke commented Jan 15, 2020

There are several ways to go about this:

  • Give a warning if the standard regression / classification config is chosen for multi outputs, which would require the user to create their own config for multi output datasets.
  • Make a list with regressors and classifiers that don't support multi output and remove them automatically if such a dataset is detected
  • Embed these regressors and classifiers in MultiOutputRegressor/Classifier in the default config. Not sure about the performance losses there

@weixuanfu
Copy link
Contributor

Thank you for your ideas here. I prefer the 3rd one but I think a better solution based on it is to automatically use MultiOutputRegressor/Classifier over pipelines that generated from current default config if TPOT detected the y is a multi-output target. Please let me know if you have any idea.

@jhmenke
Copy link
Contributor Author

jhmenke commented Jan 20, 2020

While working on it, i think i spotted a bug:

import_hash[import_str].append(dep_op_str)

This should be dep_import_str as the key and not import_str, right? It seems this also changes some tests.

@jhmenke
Copy link
Contributor Author

jhmenke commented Jan 21, 2020

If i run the failing tests manually, they work. It has to be some unsuitable combination due to the random seed.. i'm not sure how to debug this.

@weixuanfu
Copy link
Contributor

While working on it, i think i spotted a bug:

import_hash[import_str].append(dep_op_str)

This should be dep_import_str as the key and not import_str, right? It seems this also changes some tests.

Yes, that is a bug. Hmm, I thought I fixed it a while ago but might not merge to master/dev branch.

@@ -501,6 +500,35 @@ def _fit_init(self):
self._last_optimized_pareto_front_n_gens = 0
self._setup_config(self.config_dict)

if multi_output_target:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modifying _config_dict may not work in the situation that use use a customized configurations instead of default one. So, I think a practical way is to modify the _compile_to_sklearn function (here). If multi_output_target is True, then
sklearn_pipeline=MultiOutputClassifier(estimator=sklearn_pipeline) or sklearn_pipeline=MultiOutputRegessor(estimator=sklearn_pipeline) . I think it maybe a more general solution for multioutput dataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems better, to be honest i didn't find a good place where to put my code and only settled on the _fit_init function therefore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay i looked into it, but the code would be a mess. Several functions would have to take multi_output_target as a new argument (most of them in export_utils.py), since they don't have access to the data or the TPOT Object.

Imo _fit_init seems to be the least intrusive point to include the checks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for looking into. You are right. I think TPOT exported codes should also include MultiOutputRegessor/MultiOutputClassifier, which should change a lot of codes in TPOT. I will look into it when I get some time next week.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any updates?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jhmenke I am sorry for overlooking this. I did not get a chance to look into this issue those days due to my busy schedule. I agree that TPOT need some major changes for including MultiOutputRegessor/MultiOutputClassifier. I may get some time in March to add those changes. You are welcome to push any changes meanwhile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use my PR as a temporary fix until you have time to thoroughly refactor the code? I can prepare an update with the current development branch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay. I think we can use it for a temporary solution with a minor release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay i merged the current master/development into this. Should be good to go as an interim solution then.

jhmenke and others added 3 commits May 9, 2020 14:00
Copy link
Contributor

@weixuanfu weixuanfu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we cannot add this support for now since it need more changes.

if model in single_output_classifiers:
if 'sklearn.multioutput.MultiOutputClassifier' not in self._config_dict.keys():
self._config_dict['sklearn.multioutput.MultiOutputClassifier'] = {"estimator": {}}
self._config_dict['sklearn.multioutput.MultiOutputClassifier']['estimator'][model] = self._config_dict[model]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only one sklearn.multioutput.MultiOutputClassifier in self._config_dict and estimator is keeping updated until the last model in single_output_classifiers so that the rest models should be removed.

@jessegmeyerlab
Copy link

Is this the closest pull to multi-output? any plans to finish it?

@jhmenke
Copy link
Contributor Author

jhmenke commented Sep 9, 2020

Is this the closest pull to multi-output? any plans to finish it?

sorry, no time for this right now.. feel free to work with my branch. i think the last todo is cloning the multioutputclassifier for every estimator that only supports single outputs as per weixuanfu's review.

@windowshopr
Copy link

Would love to see multioutput regression/classification working with tpot! Keep going! Haha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants