Predicting sale prices via regression

Data source:

https://www.kaggle.com/c/house-prices-advanced-regression-techniques

The main notebook is included, but it is very large. To read the code, please use the HTML version of the notebook instead:

https://florinandrei.github.io/predict-sale-prices-regression/main_notebook.html

The easiest way to run the notebook is to fork the version I have on Kaggle. Most actual compute work is disabled, you will have to re-enable the steps you want to run.

https://www.kaggle.com/code/florinandrei/sklearn-pipelines-stacking-target-encoding-pca

The dataset contains approx 2919 observations, split equally between train and test. There are no target values for the test data - that's the part that needs to be predicted. There are 79 features, divided almost equally between purely numeric, ordinal (categorical ordered), and nominative (categorical unordered). Some features have NaN values, a few of them have lots of NaNs. The target to be predicted is the sale price of each house in the test data.

Virtually the whole workflow is done in scikit-learn pipelines. There's a substantial amount of feature engineering, primarily required by the penalized linear regression models used here. Boosted trees models have also been used.

Other techniques used to construct the model pipelines:

PCA (finding outliers, clustering, dimensionality reduction)
k-means clustering (flag to the models potential clusters in the data)
target encoding
various feature transformations, either for single features, or in combinations

Several baseline models have been trained: XGBoost, LightGBM, CatBoost, Ridge, ElasticNet.

Optuna was used for:

selecting the best steps (feature transformations) in each pipeline
model tuning

The pipeline steps and the model parameters are trained together with Optuna as a single optimization loop.

Ensemble models (voting and stacking) have been trained, optimized, and tested at the end.

The best cross-validation performance was obtained from the voting regressor using XGBoost, LightGBM, CatBoost, and Ridge as base regressors.

Training the base predictors in an Optuna loop was very compute-intensive. The repo contains Terraform and Ansible scripts to spin up an AWS EC2 cluster to speed up training. The final models used here were trained on a total of 160 CPU cores in EC2, which took about half a day.

The code is designed to scale up the number of workers and use all CPUs available, by default.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
house-prices-advanced-regression-techniques		house-prices-advanced-regression-techniques
performance_files		performance_files
submission_files		submission_files
.gitignore		.gitignore
README.md		README.md
ansible.cfg		ansible.cfg
best_params_catboost.json		best_params_catboost.json
best_params_enet.json		best_params_enet.json
best_params_lgbm.json		best_params_lgbm.json
best_params_ridge.json		best_params_ridge.json
best_params_xgboost.json		best_params_xgboost.json
bsmt_qual_sf.png		bsmt_qual_sf.png
htoprc		htoprc
main_notebook.html		main_notebook.html
main_notebook.ipynb		main_notebook.ipynb
optuna_db.zip		optuna_db.zip
param_importance_catboost.csv		param_importance_catboost.csv
param_importance_enet.csv		param_importance_enet.csv
param_importance_lgbm.csv		param_importance_lgbm.csv
param_importance_ridge.csv		param_importance_ridge.csv
param_importance_xgboost.csv		param_importance_xgboost.csv
pipeline_pandas_utils.py		pipeline_pandas_utils.py
pool_area_qc.png		pool_area_qc.png
provision_optuna_cluster.ansible.yml		provision_optuna_cluster.ansible.yml
provision_optuna_db.ansible.yml		provision_optuna_db.ansible.yml
qual_cond1.png		qual_cond1.png
qual_cond2.png		qual_cond2.png
qual_cond3.png		qual_cond3.png
qual_cond4.png		qual_cond4.png
requirements.txt		requirements.txt
run_headless.sh		run_headless.sh
sns_fig.png		sns_fig.png
terraform_create_optuna_cluster.tf		terraform_create_optuna_cluster.tf

FlorinAndrei/predict-sale-prices-regression

Folders and files

Latest commit

History

Repository files navigation

Predicting sale prices via regression

About

Topics

Resources

Stars

Watchers

Forks

Languages