The R package collinear
combines four different methods to offer a
comprehensive tool for multicollinearity management:
- Pairwise correlation for numeric and categorical predictors: identification of pairwise correlation via Pearson or Spearman methods for numeric predictors, and Cramer’s V for categorical predictors.
- Variance Inflation Factor analysis (VIF): to identify multicollinearity resulting from linear combinations of other predictors.
- Target encoding of categorical predictors: transforms categorical predictors to numeric using a numeric variable as a response (usually a response variable) and handle them as numerics during the multicollinearity filtering.
- Variable prioritization: method to prioritize predictors during variable selection using expert knowledge or quantitative criteria.
These methods are integrated in the collinear()
function, which
returns a vector of selected predictors with a user-defined level of
multicollinearity.
selected_variables <- collinear(
df, #your data frame
response, #name of your response variable
predictors, #names of your predictors,
preference_order, #your predictors in order of interest
max_cor, #maximum bivariate correlation
max_vif, #maximum variance inflation factor
encoding_method, #method to convert categorical predictors into numerics
)
The package contains other functions that may be useful during multicollinearity management:
cor_select()
: likecollinear()
, but only using pairwise correlations.vif_select()
: likecollinear()
, but only using variance inflation factors.preference_order()
: to compute preference order based on univariate models.target_encoding_lab()
: to convert categorical predictors into numeric using several methods.cor_df()
: to generate a data frame with all pairwise correlation scores.cor_matrix()
: to convert a correlation data frame into matrix, or obtain a correlation matrix.vif_df()
: to obtain a data frame with all variance inflation factors.
If you found this package useful during your research work, please cite it as:
Blas M. Benito (2023). collinear: R Package for Seamless Multicollinearity Management. Version 1.0.1. doi: 10.5281/zenodo.10039489
The package collinear
can be installed from CRAN.
install.packages("collinear")
library(collinear)
The development version can be installed from GitHub.
remotes::install_github(
repo = "blasbenito/collinear",
ref = "development"
)
This section shows the basic usage of the package and offers a brief explanation on the methods used within.
The libraries below are required to run the examples in this section.
library(collinear)
library(dplyr)
library(tictoc)
The package collinear
is shipped with a data frame named vi
, with
30.000 rows and 67 columns with a mixture of numeric and categorical
variables.
dplyr::glimpse(vi)
#> Rows: 30,000
#> Columns: 68
#> $ longitude <dbl> -114.254306, 114.845693, -122.145972, 108.3…
#> $ latitude <dbl> 45.0540272, 26.2706940, 56.3790272, 29.9456…
#> $ vi_mean <dbl> 0.38, 0.53, 0.45, 0.69, 0.42, 0.68, 0.70, 0…
#> $ vi_max <dbl> 0.57, 0.67, 0.65, 0.85, 0.64, 0.78, 0.77, 0…
#> $ vi_min <dbl> 0.12, 0.41, 0.25, 0.50, 0.25, 0.48, 0.60, 0…
#> $ vi_range <dbl> 0.45, 0.26, 0.40, 0.34, 0.39, 0.31, 0.17, 0…
#> $ vi_binary <dbl> 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0…
#> $ koppen_zone <chr> "BSk", "Cfa", "Dfc", "Cfb", "Aw", "Cfa", "A…
#> $ koppen_group <chr> "Arid", "Temperate", "Cold", "Temperate", "…
#> $ koppen_description <chr> "steppe, cold", "no dry season, hot summer"…
#> $ soil_type <chr> "Cambisols", "Acrisols", "Luvisols", "Aliso…
#> $ topo_slope <int> 6, 2, 0, 10, 0, 10, 6, 0, 2, 0, 0, 1, 0, 1,…
#> $ topo_diversity <int> 29, 24, 21, 25, 19, 30, 26, 20, 26, 22, 25,…
#> $ topo_elevation <int> 1821, 143, 765, 1474, 378, 485, 604, 1159, …
#> $ swi_mean <dbl> 27.5, 56.1, 41.4, 59.3, 37.4, 56.3, 52.3, 2…
#> $ swi_max <dbl> 62.9, 74.4, 81.9, 81.1, 83.2, 73.8, 55.8, 3…
#> $ swi_min <dbl> 24.5, 33.3, 42.2, 31.3, 8.3, 28.8, 25.3, 11…
#> $ swi_range <dbl> 38.4, 41.2, 39.7, 49.8, 74.9, 45.0, 30.5, 2…
#> $ soil_temperature_mean <dbl> 4.8, 19.9, 1.2, 13.0, 28.2, 18.1, 21.5, 23.…
#> $ soil_temperature_max <dbl> 29.9, 32.6, 20.4, 24.6, 41.6, 29.1, 26.4, 4…
#> $ soil_temperature_min <dbl> -12.4, 3.9, -16.0, -0.4, 16.8, 4.1, 17.3, 5…
#> $ soil_temperature_range <dbl> 42.3, 28.8, 36.4, 25.0, 24.8, 24.9, 9.1, 38…
#> $ soil_sand <int> 41, 39, 27, 29, 48, 33, 30, 78, 23, 64, 54,…
#> $ soil_clay <int> 20, 24, 28, 31, 27, 29, 40, 15, 26, 22, 23,…
#> $ soil_silt <int> 38, 35, 43, 38, 23, 36, 29, 6, 49, 13, 22, …
#> $ soil_ph <dbl> 6.5, 5.9, 5.6, 5.5, 6.5, 5.8, 5.2, 7.1, 7.3…
#> $ soil_soc <dbl> 43.1, 14.6, 36.4, 34.9, 8.1, 20.8, 44.5, 4.…
#> $ soil_nitrogen <dbl> 2.8, 1.3, 2.9, 3.6, 1.2, 1.9, 2.8, 0.6, 3.1…
#> $ solar_rad_mean <dbl> 17.634, 19.198, 13.257, 14.163, 24.512, 17.…
#> $ solar_rad_max <dbl> 31.317, 24.498, 25.283, 17.237, 28.038, 22.…
#> $ solar_rad_min <dbl> 5.209, 13.311, 1.587, 9.642, 19.102, 12.196…
#> $ solar_rad_range <dbl> 26.108, 11.187, 23.696, 7.595, 8.936, 10.20…
#> $ growing_season_length <dbl> 139, 365, 164, 333, 228, 365, 365, 60, 365,…
#> $ growing_season_temperature <dbl> 12.65, 19.35, 11.55, 12.45, 26.45, 17.75, 2…
#> $ growing_season_rainfall <dbl> 224.5, 1493.4, 345.4, 1765.5, 984.4, 1860.5…
#> $ growing_degree_days <dbl> 2140.5, 7080.9, 2053.2, 4162.9, 10036.7, 64…
#> $ temperature_mean <dbl> 3.65, 19.35, 1.45, 11.35, 27.55, 17.65, 22.…
#> $ temperature_max <dbl> 24.65, 33.35, 21.15, 23.75, 38.35, 30.55, 2…
#> $ temperature_min <dbl> -14.05, 3.05, -18.25, -3.55, 19.15, 2.45, 1…
#> $ temperature_range <dbl> 38.7, 30.3, 39.4, 27.3, 19.2, 28.1, 7.0, 29…
#> $ temperature_seasonality <dbl> 882.6, 786.6, 1070.9, 724.7, 219.3, 747.2, …
#> $ rainfall_mean <int> 446, 1493, 560, 1794, 990, 1860, 3150, 356,…
#> $ rainfall_min <int> 25, 37, 24, 29, 0, 60, 122, 1, 10, 12, 0, 0…
#> $ rainfall_max <int> 62, 209, 87, 293, 226, 275, 425, 62, 256, 3…
#> $ rainfall_range <int> 37, 172, 63, 264, 226, 215, 303, 61, 245, 2…
#> $ evapotranspiration_mean <dbl> 78.32, 105.88, 50.03, 64.65, 156.60, 108.50…
#> $ evapotranspiration_max <dbl> 164.70, 190.86, 117.53, 115.79, 187.71, 191…
#> $ evapotranspiration_min <dbl> 13.67, 50.44, 3.53, 28.01, 128.59, 51.39, 8…
#> $ evapotranspiration_range <dbl> 151.03, 140.42, 113.99, 87.79, 59.13, 139.9…
#> $ cloud_cover_mean <int> 31, 48, 42, 64, 38, 52, 60, 13, 53, 20, 11,…
#> $ cloud_cover_max <int> 39, 61, 49, 71, 58, 67, 77, 18, 60, 27, 23,…
#> $ cloud_cover_min <int> 16, 34, 33, 54, 19, 39, 45, 6, 45, 14, 2, 1…
#> $ cloud_cover_range <int> 23, 27, 15, 17, 38, 27, 32, 11, 15, 12, 21,…
#> $ aridity_index <dbl> 0.54, 1.27, 0.90, 2.08, 0.55, 1.67, 2.88, 0…
#> $ humidity_mean <dbl> 55.56, 62.14, 59.87, 69.32, 51.60, 62.76, 7…
#> $ humidity_max <dbl> 63.98, 65.00, 68.19, 71.90, 67.07, 65.68, 7…
#> $ humidity_min <dbl> 48.41, 58.97, 53.75, 67.21, 33.89, 59.92, 7…
#> $ humidity_range <dbl> 15.57, 6.03, 14.44, 4.69, 33.18, 5.76, 3.99…
#> $ biogeo_ecoregion <chr> "South Central Rockies forests", "Jian Nan …
#> $ biogeo_biome <chr> "Temperate Conifer Forests", "Tropical & Su…
#> $ biogeo_realm <chr> "Nearctic", "Indomalayan", "Nearctic", "Pal…
#> $ country_name <chr> "United States of America", "China", "Canad…
#> $ country_population <dbl> 313973000, 1338612970, 33487208, 1338612970…
#> $ country_gdp <dbl> 15094000, 7973000, 1300000, 7973000, 15860,…
#> $ country_income <chr> "1. High income: OECD", "3. Upper middle in…
#> $ continent <chr> "North America", "Asia", "North America", "…
#> $ region <chr> "Americas", "Asia", "Americas", "Asia", "Af…
#> $ subregion <chr> "Northern America", "Eastern Asia", "Northe…
The response variables are “vi_mean”, “vi_max”, “vi_min”, and
“vi_range”, with statistics of a vegetation index named NDVI. The
predictors are stored in the character vector vi_predictors
.
vi_predictors
#> [1] "koppen_zone" "koppen_group"
#> [3] "koppen_description" "soil_type"
#> [5] "topo_slope" "topo_diversity"
#> [7] "topo_elevation" "swi_mean"
#> [9] "swi_max" "swi_min"
#> [11] "swi_range" "soil_temperature_mean"
#> [13] "soil_temperature_max" "soil_temperature_min"
#> [15] "soil_temperature_range" "soil_sand"
#> [17] "soil_clay" "soil_silt"
#> [19] "soil_ph" "soil_soc"
#> [21] "soil_nitrogen" "solar_rad_mean"
#> [23] "solar_rad_max" "solar_rad_min"
#> [25] "solar_rad_range" "growing_season_length"
#> [27] "growing_season_temperature" "growing_season_rainfall"
#> [29] "growing_degree_days" "temperature_mean"
#> [31] "temperature_max" "temperature_min"
#> [33] "temperature_range" "temperature_seasonality"
#> [35] "rainfall_mean" "rainfall_min"
#> [37] "rainfall_max" "rainfall_range"
#> [39] "evapotranspiration_mean" "evapotranspiration_max"
#> [41] "evapotranspiration_min" "evapotranspiration_range"
#> [43] "cloud_cover_mean" "cloud_cover_max"
#> [45] "cloud_cover_min" "cloud_cover_range"
#> [47] "aridity_index" "humidity_mean"
#> [49] "humidity_max" "humidity_min"
#> [51] "humidity_range" "biogeo_ecoregion"
#> [53] "biogeo_biome" "biogeo_realm"
#> [55] "country_name" "country_population"
#> [57] "country_gdp" "country_income"
#> [59] "continent" "region"
#> [61] "subregion"
The collinear()
function applies a multicollinearity filtering to
numeric and categorical variables via pairwise correlations (with
cor_select()
) and variance inflation factors (with vif_select()
).
Categorical variables are converted into numeric via target-encoding
(with target_encoding_lab()
) using a response
variable as reference.
If the response variable is not provided, categorical variables are
ignored.
The function takes these inputs:
df
: a data frame with predictors, and preferably, a response (more about this later).response
: the name of the response variable, only relevant and highly recommended if there are categorical variables within the predictors.predictors
: names of predictors involved in the multicollinearity analysis.preference_order
: names of the predictors in the user’s order of preference. Does not need to name all predictors inpredictors
!cor_method
: usually “pearson”, but also “spearman” is accepted.max_cor
: maximum correlation allowed between two predictors.max_vif
: maximum VIF allowed in a predictor.encoding_method
: method used to convert categorical variables into numeric. Only relevant when aresponse
is provided. By default, each group of the categorical variable is encoded with the mean of theresponse
across the group.
The code below shows a quick example. Notice that the argument
preference_order
was left as NULL, but will be explained later.
selected_predictors <- collinear(
df = vi,
response = "vi_mean",
predictors = vi_predictors,
preference_order = NULL,
max_cor = 0.75,
max_vif = 5,
encoding_method = "mean"
)
selected_predictors
#> [1] "country_income" "topo_diversity"
#> [3] "topo_slope" "country_population"
#> [5] "country_gdp" "humidity_range"
#> [7] "soil_soc" "region"
#> [9] "soil_clay" "soil_type"
#> [11] "subregion" "biogeo_realm"
#> [13] "soil_sand" "topo_elevation"
#> [15] "soil_nitrogen" "swi_range"
#> [17] "koppen_group" "swi_min"
#> [19] "solar_rad_max" "rainfall_min"
#> [21] "growing_season_temperature" "rainfall_range"
#> [23] "solar_rad_min" "cloud_cover_range"
The function has returned a list of predictors that should have a correlation lower than 0.75 with each other, and a VIF lower than 5. Let’s see if that’s true.
The function cor_df()
returns a data frame with pairwise correlations,
arranged by the absolute value of the correlation.
selected_predictors_cor <- cor_df(
df = vi,
response = "vi_mean",
predictors = selected_predictors
)
head(selected_predictors_cor)
#> # A tibble: 6 × 3
#> x y correlation
#> <chr> <chr> <dbl>
#> 1 solar_rad_min growing_season_temperature 0.744
#> 2 koppen_group soil_type 0.732
#> 3 soil_nitrogen soil_soc 0.729
#> 4 swi_min soil_nitrogen 0.673
#> 5 soil_sand soil_clay -0.666
#> 6 koppen_group swi_range 0.659
The data frame above shows that the maximum correlation between two of
the selected predictors is below 0.75, so here collinear()
worked as
expected.
The function vif_df()
returns a data frame with the VIF scores of all
predictors.
selected_predictors_vif <- vif_df(
df = vi,
response = "vi_mean",
predictors = selected_predictors
)
selected_predictors_vif
#> variable vif
#> 1 country_income 1.215
#> 2 topo_diversity 1.662
#> 3 topo_slope 1.929
#> 4 humidity_range 2.043
#> 5 topo_elevation 2.101
#> 6 country_gdp 2.158
#> 7 country_population 2.171
#> 8 rainfall_min 2.269
#> 9 cloud_cover_range 2.418
#> 10 soil_soc 2.744
#> 11 region 2.849
#> 12 rainfall_range 2.876
#> 13 subregion 2.900
#> 14 soil_clay 2.966
#> 15 soil_type 2.991
#> 16 solar_rad_max 3.145
#> 17 biogeo_realm 3.150
#> 18 soil_sand 3.175
#> 19 soil_nitrogen 3.376
#> 20 swi_min 3.450
#> 21 swi_range 3.781
#> 22 koppen_group 4.151
#> 23 growing_season_temperature 4.314
#> 24 solar_rad_min 4.432
The output shows that the maximum VIF is 4.2, so here collinear()
did
its work as expected.
The arguments max_cor
and max_vif
control the intensity of the
multicollinearity filtering.
#restrictive setup
selected_predictors_restrictive <- collinear(
df = vi,
response = "vi_mean",
predictors = vi_predictors,
max_cor = 0.5,
max_vif = 2.5
)
#permissive setup
selected_predictors_permissive <- collinear(
df = vi,
response = "vi_mean",
predictors = vi_predictors,
max_cor = 0.9,
max_vif = 10
)
These are the variables selected under a restrictive setup:
selected_predictors_restrictive
#> [1] "country_income" "soil_clay"
#> [3] "country_population" "topo_slope"
#> [5] "humidity_range" "topo_elevation"
#> [7] "soil_soc" "soil_silt"
#> [9] "cloud_cover_range" "region"
#> [11] "solar_rad_max" "growing_season_temperature"
#> [13] "biogeo_realm"
These are the variables selected under a more permissive setup:
selected_predictors_permissive
#> [1] "country_income" "topo_diversity"
#> [3] "topo_slope" "country_population"
#> [5] "country_gdp" "soil_soc"
#> [7] "region" "soil_type"
#> [9] "soil_nitrogen" "subregion"
#> [11] "biogeo_realm" "topo_elevation"
#> [13] "koppen_group" "biogeo_biome"
#> [15] "country_name" "soil_ph"
#> [17] "aridity_index" "growing_season_temperature"
#> [19] "rainfall_min" "rainfall_range"
#> [21] "swi_mean" "soil_temperature_max"
#> [23] "solar_rad_mean" "temperature_seasonality"
#> [25] "soil_clay" "soil_silt"
#> [27] "cloud_cover_min" "swi_range"
#> [29] "humidity_range"
As expected, the restrictive setup resulted in a smaller set of selected
predictors. There are no hard rules for max_cor
and max_vif
, and
their selection will depend on the objective of the analysis and the
nature of the predictors.
The response argument is used to encode categorical variables as
numeric. When omitted, the collinear()
function ignores categorical
variables. However, the function cor_select()
can help when there is
not a suitable response
variable in a data frame. This option is
discussed at the end of this section.
selected_predictors_response <- collinear(
df = vi,
response = "vi_mean",
predictors = vi_predictors
)
selected_predictors_no_response <- collinear(
df = vi,
predictors = vi_predictors
)
When the argument response
is used, the output may contain categorical
predictors (tagged with <chr>
, from “character” below).
dplyr::glimpse(vi[, selected_predictors_response])
#> Rows: 30,000
#> Columns: 24
#> $ country_income <chr> "1. High income: OECD", "3. Upper middle in…
#> $ topo_diversity <int> 29, 24, 21, 25, 19, 30, 26, 20, 26, 22, 25,…
#> $ topo_slope <int> 6, 2, 0, 10, 0, 10, 6, 0, 2, 0, 0, 1, 0, 1,…
#> $ country_population <dbl> 313973000, 1338612970, 33487208, 1338612970…
#> $ country_gdp <dbl> 15094000, 7973000, 1300000, 7973000, 15860,…
#> $ humidity_range <dbl> 15.57, 6.03, 14.44, 4.69, 33.18, 5.76, 3.99…
#> $ soil_soc <dbl> 43.1, 14.6, 36.4, 34.9, 8.1, 20.8, 44.5, 4.…
#> $ region <chr> "Americas", "Asia", "Americas", "Asia", "Af…
#> $ soil_clay <int> 20, 24, 28, 31, 27, 29, 40, 15, 26, 22, 23,…
#> $ soil_type <chr> "Cambisols", "Acrisols", "Luvisols", "Aliso…
#> $ subregion <chr> "Northern America", "Eastern Asia", "Northe…
#> $ biogeo_realm <chr> "Nearctic", "Indomalayan", "Nearctic", "Pal…
#> $ soil_sand <int> 41, 39, 27, 29, 48, 33, 30, 78, 23, 64, 54,…
#> $ topo_elevation <int> 1821, 143, 765, 1474, 378, 485, 604, 1159, …
#> $ soil_nitrogen <dbl> 2.8, 1.3, 2.9, 3.6, 1.2, 1.9, 2.8, 0.6, 3.1…
#> $ swi_range <dbl> 38.4, 41.2, 39.7, 49.8, 74.9, 45.0, 30.5, 2…
#> $ koppen_group <chr> "Arid", "Temperate", "Cold", "Temperate", "…
#> $ swi_min <dbl> 24.5, 33.3, 42.2, 31.3, 8.3, 28.8, 25.3, 11…
#> $ solar_rad_max <dbl> 31.317, 24.498, 25.283, 17.237, 28.038, 22.…
#> $ rainfall_min <int> 25, 37, 24, 29, 0, 60, 122, 1, 10, 12, 0, 0…
#> $ growing_season_temperature <dbl> 12.65, 19.35, 11.55, 12.45, 26.45, 17.75, 2…
#> $ rainfall_range <int> 37, 172, 63, 264, 226, 215, 303, 61, 245, 2…
#> $ solar_rad_min <dbl> 5.209, 13.311, 1.587, 9.642, 19.102, 12.196…
#> $ cloud_cover_range <int> 23, 27, 15, 17, 38, 27, 32, 11, 15, 12, 21,…
However, when the argument response
is ignored, all categorical
predictors are ignored.
dplyr::glimpse(vi[, selected_predictors_no_response])
#> Rows: 30,000
#> Columns: 18
#> $ topo_diversity <int> 29, 24, 21, 25, 19, 30, 26, 20, 26, 22, 25,…
#> $ country_gdp <dbl> 15094000, 7973000, 1300000, 7973000, 15860,…
#> $ country_population <dbl> 313973000, 1338612970, 33487208, 1338612970…
#> $ topo_slope <int> 6, 2, 0, 10, 0, 10, 6, 0, 2, 0, 0, 1, 0, 1,…
#> $ soil_soc <dbl> 43.1, 14.6, 36.4, 34.9, 8.1, 20.8, 44.5, 4.…
#> $ soil_clay <int> 20, 24, 28, 31, 27, 29, 40, 15, 26, 22, 23,…
#> $ soil_sand <int> 41, 39, 27, 29, 48, 33, 30, 78, 23, 64, 54,…
#> $ soil_nitrogen <dbl> 2.8, 1.3, 2.9, 3.6, 1.2, 1.9, 2.8, 0.6, 3.1…
#> $ humidity_range <dbl> 15.57, 6.03, 14.44, 4.69, 33.18, 5.76, 3.99…
#> $ topo_elevation <int> 1821, 143, 765, 1474, 378, 485, 604, 1159, …
#> $ cloud_cover_range <int> 23, 27, 15, 17, 38, 27, 32, 11, 15, 12, 21,…
#> $ solar_rad_max <dbl> 31.317, 24.498, 25.283, 17.237, 28.038, 22.…
#> $ rainfall_min <int> 25, 37, 24, 29, 0, 60, 122, 1, 10, 12, 0, 0…
#> $ growing_season_temperature <dbl> 12.65, 19.35, 11.55, 12.45, 26.45, 17.75, 2…
#> $ rainfall_range <int> 37, 172, 63, 264, 226, 215, 303, 61, 245, 2…
#> $ swi_range <dbl> 38.4, 41.2, 39.7, 49.8, 74.9, 45.0, 30.5, 2…
#> $ solar_rad_min <dbl> 5.209, 13.311, 1.587, 9.642, 19.102, 12.196…
#> $ swi_min <dbl> 24.5, 33.3, 42.2, 31.3, 8.3, 28.8, 25.3, 11…
If there are categorical variables in a data frame, but there is no
suitable response
variable, then the function cor_select()
can
handle the multicollinearity management via pairwise correlations, but
at a MUCH higher computational cost, and with different results, as
shown below.
tictoc::tic()
selected_predictors_response <- cor_select(
df = vi,
response = "vi_mean",
predictors = vi_predictors
)
tictoc::toc()
#> 0.414 sec elapsed
tictoc::tic()
selected_predictors_no_response <- cor_select(
df = vi,
predictors = vi_predictors
)
tictoc::toc()
#> 34.603 sec elapsed
selected_predictors_response
#> [1] "country_population" "topo_elevation"
#> [3] "country_income" "country_gdp"
#> [5] "topo_slope" "humidity_range"
#> [7] "soil_clay" "topo_diversity"
#> [9] "soil_sand" "cloud_cover_range"
#> [11] "region" "growing_season_temperature"
#> [13] "solar_rad_min" "soil_soc"
#> [15] "rainfall_min" "swi_range"
#> [17] "soil_nitrogen" "rainfall_range"
#> [19] "temperature_max" "swi_min"
#> [21] "subregion" "temperature_seasonality"
#> [23] "biogeo_realm" "cloud_cover_min"
#> [25] "soil_type" "aridity_index"
#> [27] "solar_rad_max" "koppen_group"
#> [29] "cloud_cover_max"
selected_predictors_no_response
#> [1] "topo_elevation" "topo_slope"
#> [3] "country_population" "topo_diversity"
#> [5] "soil_clay" "humidity_range"
#> [7] "soil_sand" "country_gdp"
#> [9] "cloud_cover_range" "country_income"
#> [11] "rainfall_min" "soil_soc"
#> [13] "swi_range" "growing_season_temperature"
#> [15] "rainfall_range" "soil_nitrogen"
#> [17] "solar_rad_min" "aridity_index"
#> [19] "cloud_cover_min" "temperature_max"
#> [21] "region" "swi_min"
#> [23] "solar_rad_max" "evapotranspiration_range"
#> [25] "swi_mean" "humidity_max"
#> [27] "soil_type"
The variable selection results differ because the numeric
representations of the categorical variables are rather different
between the two options. When no response
is provided, the function
cor_select()
compares categorical predictors against numeric ones by
encoding each categorical after each numeric, and compares pairs of
categoricals using Cramer’s V, implemented in the function cramer_v()
.
Additionally, Cramer’s V values are not directly comparable with Pearson
or Spearman correlation scores, and having them together in the same
analysis might induce bias during the variable selection. Not using the
response
argument should always be the last option.
The argument preference_order
gives the user some control on what
predictors should be removed first and what predictors should be kept
during the multicollinearity filtering. This argument accepts a vector
of predictor names in the order of interest, or the result of the
function preference_order()
, which allows to define preference order
following a quantitative criteria.
Let’s start with the former option. Below, the argument
preference_order
names several predictors that are of importance for a
hypothetical analysis. The predictors
not in preference_order
are
ranked by the absolute sum of their correlations with other predictors
during the pairwise correlation filtering, and by their VIF during the
VIF-based filtering.
selected_predictors <- cor_select(
df = vi,
response = "vi_mean",
predictors = vi_predictors,
preference_order = c(
"soil_temperature_mean",
"soil_temperature_max",
"soil_type"
)
)
selected_predictors
#> [1] "soil_temperature_mean" "soil_temperature_max" "soil_type"
#> [4] "country_population" "topo_elevation" "country_income"
#> [7] "country_gdp" "topo_slope" "humidity_range"
#> [10] "soil_clay" "topo_diversity" "soil_sand"
#> [13] "cloud_cover_range" "region" "soil_soc"
#> [16] "rainfall_min" "solar_rad_range" "swi_range"
#> [19] "soil_nitrogen" "rainfall_range" "subregion"
#> [22] "biogeo_realm" "aridity_index" "solar_rad_max"
#> [25] "koppen_group" "soil_temperature_range" "cloud_cover_max"
Notice that in the output, two of the variables in preference_order
are selected (“soil_temperature_mean” and “soil_type”), but one was
removed (“soil_temperature_max”). This happens because at some point in
the selection, the VIF of “soil_temperature_mean” and
“soil_temperature_max” was higher than max_vif
, and the one with lower
preference was removed.
The function preference_order()
requires the response
argument, and
takes a function f
that returns a value of association between the
response and any predictor. This value is then located in the
“preference” column of the function’s output.
preference_rsquared <- preference_order(
df = vi,
response = "vi_mean",
predictors = vi_predictors,
f = f_rsquared,
workers = 1 #requires package future and future.apply for more workers
)
preference_rsquared
#> predictor preference
#> 1 biogeo_ecoregion 0.8971347093
#> 2 growing_season_length 0.8076216576
#> 3 koppen_zone 0.8050280970
#> 4 koppen_description 0.7903458680
#> 5 soil_ph 0.7664428862
#> 6 swi_mean 0.7286901614
#> 7 humidity_mean 0.7141389404
#> 8 koppen_group 0.6996959734
#> 9 biogeo_biome 0.6515724588
#> 10 country_name 0.6448346803
#> 11 cloud_cover_mean 0.6338773126
#> 12 soil_type 0.6318025761
#> 13 rainfall_mean 0.6005761078
#> 14 humidity_max 0.5876622545
#> 15 soil_temperature_max 0.5827628810
#> 16 swi_max 0.5813558512
#> 17 cloud_cover_max 0.5758002449
#> 18 humidity_min 0.5705720164
#> 19 growing_season_rainfall 0.5697006759
#> 20 soil_temperature_range 0.5523074848
#> 21 biogeo_realm 0.5031101984
#> 22 solar_rad_max 0.4905225950
#> 23 evapotranspiration_max 0.4814731607
#> 24 rainfall_max 0.4783927311
#> 25 aridity_index 0.4506424015
#> 26 subregion 0.4469207404
#> 27 swi_range 0.4217411381
#> 28 cloud_cover_min 0.4135724066
#> 29 evapotranspiration_range 0.4042241481
#> 30 temperature_range 0.3753489250
#> 31 rainfall_range 0.3545446680
#> 32 temperature_seasonality 0.2499469281
#> 33 rainfall_min 0.2484813976
#> 34 swi_min 0.2406964836
#> 35 solar_rad_mean 0.2140860965
#> 36 soil_nitrogen 0.1872886789
#> 37 continent 0.1818717607
#> 38 temperature_max 0.1589418736
#> 39 region 0.1505256024
#> 40 soil_soc 0.1493958026
#> 41 evapotranspiration_mean 0.1455828419
#> 42 solar_rad_range 0.1300751363
#> 43 temperature_min 0.1222051434
#> 44 cloud_cover_range 0.1216812855
#> 45 soil_temperature_min 0.1018471531
#> 46 topo_diversity 0.0925948262
#> 47 soil_clay 0.0769366113
#> 48 humidity_range 0.0575393339
#> 49 country_income 0.0489946403
#> 50 soil_sand 0.0427943817
#> 51 topo_elevation 0.0424759731
#> 52 growing_season_temperature 0.0239161476
#> 53 topo_slope 0.0203697134
#> 54 soil_temperature_mean 0.0170527033
#> 55 temperature_mean 0.0067479780
#> 56 soil_silt 0.0059316757
#> 57 growing_degree_days 0.0047849144
#> 58 evapotranspiration_min 0.0009965488
#> 59 country_gdp 0.0008850479
#> 60 solar_rad_min 0.0005751350
#> 61 country_population 0.0002513147
The result of preference_order()
can be plugged right away into the
preference_order
argument of collinear.
selected_predictors <- collinear(
df = vi,
response = "vi_mean",
predictors = vi_predictors,
preference_order = preference_rsquared
)
selected_predictors
#> [1] "biogeo_ecoregion" "biogeo_realm"
#> [3] "solar_rad_max" "rainfall_max"
#> [5] "subregion" "swi_range"
#> [7] "rainfall_min" "soil_nitrogen"
#> [9] "continent" "soil_soc"
#> [11] "cloud_cover_range" "topo_diversity"
#> [13] "soil_clay" "humidity_range"
#> [15] "country_income" "soil_sand"
#> [17] "topo_elevation" "growing_season_temperature"
#> [19] "topo_slope" "country_gdp"
#> [21] "country_population"
This variable selection satisfies three conditions at once: maximum correlation between each predictor and the response, maximum pairwise correlation, and maximum VIF.
The f
argument used by default is the function f_rsquared()
, that
returns the R-squared between the response and any predictor.
f_rsquared(
x = "growing_season_length",
y = "vi_mean",
df = vi
)
#> [1] 0.8076217
There are several other f
functions implemented:
f_gam_deviance()
: returns the explained deviance of a univariate GAM model between the response and each predictor, fitted with the functionmgcv::gam()
. Only if the R packagemgcv
is installed in the system.f_rf_rsquared()
(also namedf_rf_deviance()
): returns the explained deviance of a univariate Random Forest model between the response and each predictor, fitted with the functionranger::ranger()
. Only if the R packageranger
is installed in the system.f_logistic_auc_balanced()
andf_logistic_auc_unbalanced()
: return the area under the ROC curve of univariate binomial GLM between a binary response of 1s and 0s and a numeric predictor. The former assumes the response is balanced, while the latter applies case weights to mitigate unbalances.f_gam_auc_balanced()
andf_gam_auc_unbalanced()
: return the area under the ROC curve of univariate binomial GAM between a binary response of 1s and 0s and a numeric predictor. The former assumes the response is balanced, while the latter applies case weights to mitigate unbalances.f_rf_auc_balanced()
andf_rf_auc_unbalanced()
: return the area under the ROC curve of univariate random forest models between a binary response of 1s and 0s and a numeric predictor. The former assumes the response is balanced, while the latter applies case weights to mitigate unbalances.
#example of preference order for a binary variable
#the binary variable
table(vi$vi_binary)
#computation of preference order with
preference_auc <- preference_order(
df = vi,
response = "vi_binary",
predictors = vi_predictors,
f = f_gam_auc_unbalanced,
workers = 1 #requires package future and future.apply for more workers
)
preference_auc
Custom functions created by the user are also accepted as input, as long
as they have the x
, y
, and df
arguments, and they return a single
numeric value.
These can be run in parallel across predictors by increasing the value
of the workers
argument if the R packages future
and future.apply
are installed in the system.
The functions cor_select()
and vif_select()
, called within
collinear()
, perform the pairwise correlation filtering, and the
VIF-based filtering. The main difference between them is that
cor_select()
can handle categorical predictors even when the
response
is omitted, while vif_select()
ignores them entirely in
such case.
selected_predictors_cor <- cor_select(
df = vi,
response = "vi_mean",
predictors = vi_predictors,
preference_order = preference_rsquared
)
selected_predictors_vif <- vif_select(
df = vi,
response = "vi_mean",
predictors = vi_predictors,
preference_order = preference_rsquared
)
selected_predictors_cor
#> [1] "biogeo_ecoregion" "soil_temperature_max"
#> [3] "soil_temperature_range" "biogeo_realm"
#> [5] "solar_rad_max" "rainfall_max"
#> [7] "aridity_index" "subregion"
#> [9] "swi_range" "rainfall_min"
#> [11] "solar_rad_mean" "soil_nitrogen"
#> [13] "continent" "soil_soc"
#> [15] "solar_rad_range" "cloud_cover_range"
#> [17] "topo_diversity" "soil_clay"
#> [19] "humidity_range" "country_income"
#> [21] "soil_sand" "topo_elevation"
#> [23] "growing_season_temperature" "topo_slope"
#> [25] "country_gdp" "country_population"
selected_predictors_vif
#> [1] "biogeo_ecoregion" "soil_type" "rainfall_mean"
#> [4] "biogeo_realm" "solar_rad_max" "subregion"
#> [7] "soil_nitrogen" "continent" "soil_soc"
#> [10] "topo_diversity" "soil_clay" "country_income"
#> [13] "soil_sand" "topo_slope" "country_gdp"
#> [16] "country_population"
The function target_encoding_lab()
is used within all other functions
in the package to encode categorical variables as numeric. It implements
four target encoding methods:
- “mean” (in
target_encoding_mean()
): replaces each category with the mean of the response across the category. White noise can be added to this option to increase data variability. - “rank” (in
target_encoding_rank()
): replaces each category with the rank of the mean of the response across the category. - “rnorm” (in
target_encoding_rnorm()
): replaces each value in a category with a number generated bystats::rnorm()
from a normal distribution with the mean and the standard deviation of the response over the category. - “loo” (in
target_encoding_loo()
): replaces each value in a category with the mean of the response across all the other cases within the category. White noise can be added to this option to increase data variability.
The method “mean” is used as default throughout all functions in the
package, but can be changed via the argument encoding_method
.
Below we use all methods to generate different numeric encodings for the categorical variable “koppen_zone”.
df <- target_encoding_lab(
df = vi,
response = "vi_mean",
predictors = "koppen_zone",
encoding_methods = c(
"mean",
"rank",
"rnorm",
"loo"
),
seed = 1,
rnorm_sd_multiplier = c(0, 0.01, 0.1),
white_noise = c(0, 0.01, 0.1),
verbose = TRUE
)
#>
#> Encoding the predictor: koppen_zone
#> New encoded predictor: 'koppen_zone__encoded_rank'
#> New encoded predictor: 'koppen_zone__encoded_mean'
#> New encoded predictor: 'koppen_zone__encoded_loo'
#> New encoded predictor: 'koppen_zone__encoded_rank__noise_0.01'
#> New encoded predictor: 'koppen_zone__encoded_mean__noise_0.01'
#> New encoded predictor: 'koppen_zone__encoded_loo__noise_0.01'
#> New encoded predictor: 'koppen_zone__encoded_rank__noise_0.1'
#> New encoded predictor: 'koppen_zone__encoded_mean__noise_0.1'
#> New encoded predictor: 'koppen_zone__encoded_loo__noise_0.1'
#> New encoded predictor: 'koppen_zone__encoded_rnorm'
#> New encoded predictor: 'koppen_zone__encoded_rnorm__sd_multiplier_0.01'
#> New encoded predictor: 'koppen_zone__encoded_rnorm__sd_multiplier_0.1'
The relationship between these encoded versions of “koppen_zone” and the response are shown below.
#get names of encoded variables
koppen_zone_encoded <- grep(
pattern = "*__encoded*",
x = colnames(df),
value = TRUE
)
#record the user's graphical parameters
user.par <- par(no.readonly = TRUE)
#modify graphical parameters for the plot
par(mfrow = c(4, 3))
#plot target encoding
x <- lapply(
X = koppen_zone_encoded,
FUN = function(x) plot(
x = df[[x]],
y = df$vi_mean,
xlab = x,
ylab = "vi_mean",
cex = 0.5
)
)
#reset the user's graphical parameters
par(user.par)
The function implementing each method can be used directly as well. The
example below shows the “mean” method with the option replace = FALSE
,
which replaces the categorical values with the numeric ones in the
output data frame.
head(vi[, c("vi_mean", "koppen_zone")], n = 10)
#> vi_mean koppen_zone
#> 1 0.38 BSk
#> 2 0.53 Cfa
#> 3 0.45 Dfc
#> 4 0.69 Cfb
#> 5 0.42 Aw
#> 6 0.68 Cfa
#> 7 0.70 Af
#> 8 0.26 BSh
#> 9 0.55 Cwa
#> 10 0.16 BWh
df <- target_encoding_mean(
df = vi,
response = "vi_mean",
predictor = "koppen_zone",
replace = TRUE
)
head(df[, c("vi_mean", "koppen_zone")], n = 10)
#> vi_mean koppen_zone
#> 1 0.38 0.2487370
#> 2 0.53 0.5661689
#> 3 0.45 0.4338492
#> 4 0.69 0.5889908
#> 5 0.42 0.5275241
#> 6 0.68 0.5661689
#> 7 0.70 0.6708994
#> 8 0.26 0.3230049
#> 9 0.55 0.5218936
#> 10 0.16 0.1330452
If you got here, thank you for your interest in collinear
. I hope you
can find it useful!
Blas M. Benito, PhD