Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sklearn compatibility #101

Closed
wants to merge 13 commits into from
Closed

sklearn compatibility #101

wants to merge 13 commits into from

Conversation

petrovicboban
Copy link
Contributor

No description provided.

@petrovicboban petrovicboban self-assigned this Apr 21, 2023
@petrovicboban petrovicboban added enhancement New feature or request Python Python changes labels Apr 21, 2023
@petrovicboban
Copy link
Contributor Author

@edwardwliu can you figure out why test is failing here? I've tried to do it, but it goes deep into logic (categorical feature mapping).

@petrovicboban petrovicboban force-pushed the bp/sklearn branch 2 times, most recently from 70b6587 to a1b0f64 Compare April 21, 2023 19:24
@edwardwliu
Copy link
Collaborator

edwardwliu commented Apr 22, 2023

@petrovicboban The issue was find_match() wasn't identifying matching values for different np.int and np.float64 types. I added a cast, but it's possible this could be handled in a cleaner manner. The specific test appears case to be passing now.

@petrovicboban
Copy link
Contributor Author

@edwardwliu thanks! I thought the issue was in calling function, because the test was failing when arr_b was numpy array. In all other tests, it was list.

@petrovicboban
Copy link
Contributor Author

petrovicboban commented Apr 22, 2023

@edwardwliu
it seems that fit() should not change parameters which are set during __init__(), and that's exactly what we do.
Here is test failure: https://github.com/forestry-labs/Rforestry/actions/runs/4774656551/jobs/8488479086

I guess we should skip setting those parameters in __init__() and set them in fit() only, but I'm not sure about implications of that.

@edwardwliu
Copy link
Collaborator

@petrovicboban I see, let's only use the following parameters in fit() and remove them from __init__() if overlapping:

  • feature_weights
  • deep_feature_weights
  • observation_weights
  • lin_feats
  • monotonic_constraints
  • groups

@petrovicboban
Copy link
Contributor Author

@edwardwliu it looks like we need to move these too:

        if self.max_depth is None:
            self.max_depth = round(nrow / 2) + 1

        if self.interaction_depth is None:
            self.interaction_depth = self.max_depth

        if self.max_obs is None:
            self.max_obs = y.size

because they depend on argument of fit()

@edwardwliu
Copy link
Collaborator

I see, yes let's move those parameters to fit() for now as well. Conceptually, the params max_depth and interaction_depth could be agnostic to the data provided in fit(), but the implementation in this package requires this change.

@petrovicboban
Copy link
Contributor Author

petrovicboban commented Apr 25, 2023

@edwardwliu can you check current test failures? One is related to you recent change (groups) and one is from check_estimator()

@edwardwliu
Copy link
Collaborator

@petrovicboban Both of these test cases should be passing be now. The estimator test was because we do not currently support a data type scipy.sparse.csc_matrix. In addition, it may be useful to parameterize checks while I debug.

@petrovicboban petrovicboban force-pushed the bp/sklearn branch 2 times, most recently from ae7357e to 15b4b49 Compare April 26, 2023 12:41
@petrovicboban
Copy link
Contributor Author

@edwardwliu this looks good for now. It fails only on estimator's pickle check. Probably because saved_forest_ attribute is expected during the process but is added only after call to translate_tree. I've tried several solutions, but failed.

After we solve that, I need to improve __init__ parameters validation.

@github-actions
Copy link

github-actions bot commented May 22, 2023

☂️ Python Coverage

current status: ✅

Overall Coverage

Statements Covered Coverage Threshold Status
814 646 79% 60% 🟢

New Files

No new covered files...

Modified Files

File Coverage Status
Python/random_forestry/forestry.py 79% 🟢
Python/random_forestry/preprocessing.py 72% 🟢
Python/random_forestry/validators/init.py 100% 🟢
Python/random_forestry/validators/base_validator.py 91% 🟢
Python/random_forestry/validators/fit_validator.py 87% 🟢
Python/random_forestry/validators/predict_validator.py 89% 🟢
TOTAL 86% 🟢

**updated for commit: dc2031a

@petrovicboban
Copy link
Contributor Author

@petrovicboban Both of these test cases should be passing be now. The estimator test was because we do not currently support a data type scipy.sparse.csc_matrix. In addition, it may be useful to parameterize checks while I debug.

It's passing with:

    def _more_tags(self):
        return {
            "_xfail_checks": {
                "check_estimators_pickle": "To be fixed later",
                "check_n_features_in": "To be fixed later",
                "check_estimators_nan_inf": "To be fixed later",
                "check_dtype_object": "To be fixed later",
            },
        }

@petrovicboban petrovicboban marked this pull request as ready for review May 23, 2023 20:56
@petrovicboban petrovicboban force-pushed the bp/sklearn branch 5 times, most recently from d39b195 to a849644 Compare May 24, 2023 18:09
@Ilia-Shutov
Copy link
Collaborator

New PR based on this one was created #146

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Python Python changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

has_nas method defined, but not used Compatibility with sklearn estimators
3 participants