Skip to content

wlattner/rf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rf

This is a Go implementation of the random forest algorithm for classification and regression. Both the random forest and the decision tree are usable as standalone Go packages. The cli can fit a model from a csv file and make predictions from a previously fitted model. The csv parser is rather limited, only numeric feature values are accepted.

GoDoc

Install

go get github.com/wlattner/rf

Usage

Fit

A model can be fitted from a csv file, the label or target value should be the first column and the remaining columns should be numeric features. The file may contain a header row. If a header row is present, the column names will be used for variable names in the variable importance report (see below). For example, the iris data would appear as:

"Species","Sepal.Length","Sepal.Width","Petal.Length","Petal.Width"
"setosa",5.1,3.5,1.4,0.2
"setosa",4.9,3,1.4,0.2
"setosa",4.7,3.2,1.3,0.2
"setosa",4.6,3.1,1.5,0.2
"setosa",5,3.6,1.4,0.2
"setosa",5.4,3.9,1.7,0.4
"setosa",4.6,3.4,1.4,0.3
"setosa",5,3.4,1.5,0.2
"setosa",4.4,2.9,1.4,0.2
...

Assuming these data are in a file named iris.csv, a model would be fitted with the following command:

rf -d iris.csv -f iris.model

Args

-d, --data arg example data

-f --final_model arg (=rf.model) file to output fitted model

--var_importance arg file to output variable importance estimates

--trees arg (=10) number of trees to include in forest

--min_split arg (=2) minimum number of samples required to split an internal node

--min_leaf arg (=1) minimum number of samples in newly created leaves

--max_features arg (=-1) number of features to consider when looking for the best split, -1 will default to √(# features)

--impurity arg (=gini) the measure to use for evaluating candidate splits, must be gini or entropy

--workers arg (=1) number of workers for fitting trees

-c, --classification force parser to use integer/numeric labels for classification

Regression is also supported, the csv parser will detect if the first column is numeric or categorical. If the class labels look like numbers:

"Species",Sepal.Length","Sepal.Width","Petal.Length","Petal.Width"
"1",5.1,3.5,1.4,0.2
"1",4.9,3,1.4,0.2
...
"2",5.9,3,5.1,1.8

the parser will get confused and fit a regression model (it can't tell the difference between "1" and 1), if this happens, try running with the --classification flag.

Output

After the input data is parsed and the forest fitted, rf will write a diagnostic report to stderr.

Fit 10 trees using 150 examples in 0.00 seconds

Variable Importance
-------------------
Petal.Length   : 0.55
Petal.Width    : 0.40
Sepal.Width    : 0.03
Sepal.Length   : 0.02

Confusion Matrix
----------------
               setosa         versicolor     virginica
setosa         50             1              0
versicolor     0              46             5
virginica      0              3              45

Overall Accuracy: 94.00%

The confusion matrix and overall accuracy are estimated from out of bag samples for each tree in the forest. The report will show up to 20 variables in the variable importance section in decreasing order of importance. If your data have more predictors, the importance estimates for all variables can be written to a csv file using the --var_importance flag.

For a regression model:

Fit 10 trees using 506 examples in 0.01 seconds

Variable Importance
-------------------
rm             : 0.38
lstat          : 0.30
nox            : 0.12
crim           : 0.05
dis            : 0.03
ptratio        : 0.03
tax            : 0.02
age            : 0.02
black          : 0.02
rad            : 0.01
indus          : 0.01
zn             : 0.01
chas           : 0.00


Mean Squared Error: 15.677
R-Squared: 81.487%

The mean squared error is computed from out of bag samples for each tree in the forest. The variable importance is reported in the same manner as classification.

Predict

Predictions can be made from a previously fitted model. The data for making predictions should be in a csv file with a format similar to the data used to fit the model, however, the first column will be ignored.

"Species","Sepal.Length","Sepal.Width","Petal.Length","Petal.Width"
"",5.1,3.5,1.4,0.2
"",4.9,3,1.4,0.2
"",4.7,3.2,1.3,0.2
"",4.6,3.1,1.5,0.2
"",5,3.6,1.4,0.2
"",5.4,3.9,1.7,0.4
"",4.6,3.4,1.4,0.3
"",5,3.4,1.5,0.2
"",4.4,2.9,1.4,0.2
...
rf -d iris.csv -p iris_predictions.csv -f iris.model

Args

-d, --data arg example data

-p, --predictions arg file to output predictions

-f, --final_model arg (=rf.model) file with previously fitted model

Docs

Documentation for the two packages, forest and tree can be found on godoc. tree implements classification trees while forest implements random forests using tree. See rf.go in this repository for an example of using the forest package.

forest: http://godoc.org/github.com/wlattner/rf/forest

tree: http://godoc.org/github.com/wlattner/rf/tree

References

[1] Louppe, G. (2014) "Understanding Random Forests: From Theory to Practice" (PhD thesis)

[2] Breiman, “Random Forests”, Machine Learning, 45(1), 5-32, 2001

Releases

No releases published

Packages

No packages published

Languages