Add Random Forest Step #46

Gitiauxx · 2018-09-24T19:39:20Z

This PR adds a template for RandomForestStep.

The template creates automatically a random forest model using the sklearn libary with a fit and run The code can be found in regression.RandomForestStep.

For an example with real data, see machine_learning_steps_rental_prices

The model can be saved in the configs directory as a yaml file and a pickled file.

Usage
A random forest is created as follows:
m = RandomForestRegressionStep()

RandomForestRegressionStep takes thew following inputs:

a model expression: formula with lhs and rhs sides separated by '~'

m.model_expression = '... ~ ...'

a table registered in orca

m.tables = '...'

a model name

m.name = '....'

The main difference with previous steps is that the model, once built, cannot be saved completely in a yaml file and has to be pickled:

To save a random forest model, use the register method from modelmanager.py:
modelmanager.register(m)
which will add in orca a step with name m.name and pickle the model in a file saved in the configs directory.
To load a model that has been saved in the configs directory, use the initialize methods from the modelmanager.py:

modelmanager.initialize()

which will unpickle the model and turn it back to an object RandomForestStep.

Fitting and cross-validation
The random forest step includes some validation metrics:

feature importance to help identify main explanatory variables via the command

m.importance

cross validation: using k-fold techniques, hold out one fold, train the model on k-1 folds and test on the held out fold.

m.cross_validate_scorer(nsplits=k)

returns the mean of the errors across all folds.

Additional libraries
RandomForestStep is built on top of sklearn machine learninglibrary. Version 0.19.2 or more recent is required. Moreover to efficiently save and load complicated random forest models, modelmanager provides a pickle./unplickle utility based on dill (version 0.2.8.2 or more recent).

…he rhs variables

exit

…om model expression

…ation to random forest

smmaurer · 2018-10-15T21:12:45Z

Hi @Gitiauxx,

Apologies, I'm just getting a chance to review this. Generally looks great; here are some initial comments:

Looks like your text editor is mixing tabs and spaces, which makes some of the code hard to read on Github -- standard Python style is spaces only, which you can probably switch to in a setting. (PEP 8)

I believe it is fixed.

If it's not too much trouble, let's move the neural network template out of this PR and into a subsequent one, just to keep things simpler.

Done.

For _get_input_columns(): generally, we're trying to support Patsy-style model expressions, and use Patsy to parse them. Maybe there's a Patsy function to get the names of input columns? An advantage is it would automatically support inline transformations.

Yes and no:
The initial issue is that sklearn-based models (and probably other estimations libraries) do not rely on patsy formulas as inputs. This is an important difference with statsmodel besed models.

However, I do not need _get_input_columns() anymore because I created a utility that takes patsy formula and generates the corresponding inputs for sklearn models (see from_patsy_to_array in utils.py). It also creates a list of variable names.

Related to the issue of generalizing away from statsmodel based steps, see #43 with explanation for why we need a utility like convert_to_model to convert sklearn models' fit and predict methods into methods that mimic what statsmodel fit and predict methods do.

Can you add docstrings for all of your utilities, with detailed descriptions of the what the functions do, the parameters they take, and their return values?

Absolutely. I believe I went over all of them. Let me know if the format works for you.

Looks like there are duplicate modelmanager imports now in binary_logit.py, large_multinomial_logit.py, and small_multinomial_logit.py, probably from merging master into this branch or something.

Fixed. Probably my mistake when I merged master with another branch.

You can update the library version number to 0.1.dev16 in setup.py and urbansim_templates/__init__.py. (I figure we'll merge PR [0.1.dev15] Updating Large MNL template with better simulation functionality #51 before this one.)

Done.

Could you make a note in the PR writeup about the additional libraries (and their version numbers) that these templates require? We should add them to setup.py and also to our conda environments for working with the templates (eg here).

Done.

I haven't tried running your tests, but are you using pytest? It's a useful utility that makes it easier to run a whole suite of tests at once, either locally or on a continuous integration server (which we'll set up soon for this library). You can run it from the command line with pytest *.py. See my recent updates to test_large_multinomial_logit.py for an example of some of the features. You can use fixtures for data setup instead of __main__.

smmaurer · 2018-10-18T19:55:20Z

Awesome, thank you for all the fixes! This looks great.

I left some related comments in issues #43 and #50 too.

The functionality of the templates looks good, and the cross-validation is going to be really helpful.

A couple more issues with the tests:

The cross-validation and gradient boosting tests use data files that aren't committed to the repo. (I've just been using fake data for tests so far, but real data seems great.) I forget, are the files small enough that we can just add them to the repo? Under 1 MB would be great. If not, would the tests still work with smaller excerpts of the data?

The file paths in those tests are also Windows-specific; you can use os.path.join() to make them cross-platform.

And you can move the neural network tests to the branch with that model.

Once the tests run on my machine, let's merge this!

smmaurer · 2018-10-18T20:01:53Z

Just took a look at the merge conflicts (easy to fix in the web interface)..

setup.py: we should keep your version line
test_large_multinomial_logit.py: keep the length assertion line from master, otherwise keep additions from both branches
utils.py: keep additions from both branches

Gitiauxx · 2018-10-19T13:19:11Z

Thanks @smmaurer

I will go over and develop a comprehensive battery of tests for this PR using pytest.

…into xavier_testing :q :wq

Gitiauxx · 2018-10-24T13:15:40Z

@smmaurer

I start running a battery of tests (see most commits) for the code I wrote. So far they have passed on my machine (windows). You may want to try them on a mac.

smmaurer · 2018-10-25T19:36:25Z

@Gitiauxx Looking good, most of them are passing for me too.

Cross validation is failing with AttributeError: 'RandomForestRegressionStep' object has no attribute '_get_input_columns'

Looking at the yaml for Random Forest, i noticed a couple of things:

let's use consistent naming for the keys, so cross validation metric -> cross_validation_metric, features importance -> feature_importance
it's saving the model step name twice, in the name key (autogenerated by modelmanager) and also in the model key (generated in the template definition); the latter might not be needed?

Gitiauxx · 2018-10-25T23:42:10Z

@smmaurer

No problems.... I believe I fixed everything. cross_validate was failing because I forgot to push my last edit of shared.py...

xyzjayne · 2018-11-21T02:40:49Z

Thanks for creating this! Will there be a RF classifier on the way?

Gitiauxx and others added 5 commits September 24, 2018 12:20

Use decorator to register imported steps when initializing

4ae8385

add get_input_columns as a step attribute to avoid ugly code to get t…

7f6f916

…he rhs variables

Add random forest as as step

360e3d8

Add test for random forest

fdde25d

Merge branch 'master' into xavier_testing

e20edf6

Gitiauxx self-assigned this Sep 24, 2018

Add cross validation and feature importance (for random forest)

7f33e8f

exit

Gitiauxx requested a review from smmaurer September 26, 2018 22:06

Gitiauxx added the enhancement New feature or request label Sep 26, 2018

Gitiauxx added 8 commits October 2, 2018 12:23

add cross validation

577e4cc

add a parsing capability for random forest to eval numpy functions fr…

fb70a95

…om model expression

Add the possibility to parse output date and possible numpy transform…

c495729

…ation to random forest

Remove useless inputs in random forest steps

6f9b8fa

Fix __init__ for random forest

102fb60

add inheritance from OLSRegression in RF

066be5f

Fix bug to import/export model constructed from pickle files

5cb5251

add demo of different methods and some comparison across them

e91193d

Gitiauxx and others added 9 commits October 16, 2018 11:38

Add Doc String for utils

df023d8

Add Doc String for utils

4057264

Add examples in doc string

b161cd1

Add examples in doc string for regression

5dc5ea5

Add doc string for splits and cross_validate_score in shared.py

4754ccc

Add new required libraries in setup.py

96cffbd

Add new required libraries in environment.yml

5d151e7

remove duplicates import of modelmanager

637c2b2

Delete neuro_network.py

fc9a301

Gitiauxx added 3 commits October 19, 2018 09:39

Add a tag role to pickle file

e7553b1

branch 'xavier_testing' of https://github.com/udst/urbansim_templates …

ae7d465

…into xavier_testing :q :wq

Add tests for utils, random forest and gradient boosting

990e85a

Gitiauxx added 2 commits October 25, 2018 19:29

Remove get_inputs_columns

fcbc256

Follow naming convention

c6e40f5

smmaurer force-pushed the master branch from 71049ca to 11abed3 Compare November 19, 2018 17:45

smmaurer marked this pull request as draft February 16, 2021 23:49

smmaurer changed the base branch from master to main February 16, 2021 23:49

smmaurer changed the base branch from main to dev February 16, 2021 23:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Random Forest Step #46

Add Random Forest Step #46

Gitiauxx commented Sep 24, 2018 •

edited

Loading

smmaurer commented Oct 15, 2018 •

edited by Gitiauxx

Loading

smmaurer commented Oct 18, 2018

smmaurer commented Oct 18, 2018

Gitiauxx commented Oct 19, 2018

Gitiauxx commented Oct 24, 2018

smmaurer commented Oct 25, 2018

Gitiauxx commented Oct 25, 2018

xyzjayne commented Nov 21, 2018

Add Random Forest Step #46

Are you sure you want to change the base?

Add Random Forest Step #46

Conversation

Gitiauxx commented Sep 24, 2018 • edited Loading

smmaurer commented Oct 15, 2018 • edited by Gitiauxx Loading

smmaurer commented Oct 18, 2018

smmaurer commented Oct 18, 2018

Gitiauxx commented Oct 19, 2018

Gitiauxx commented Oct 24, 2018

smmaurer commented Oct 25, 2018

Gitiauxx commented Oct 25, 2018

xyzjayne commented Nov 21, 2018

Gitiauxx commented Sep 24, 2018 •

edited

Loading

smmaurer commented Oct 15, 2018 •

edited by Gitiauxx

Loading