Skip to content

Latest commit

History

History

scikit_learn_data_processing_and_model_evaluation

Folders and files

NameName
Last commit message
Last commit date

parent directory

..

Scikit-Learn Data Processing and Model Evaluation

This notebook shows how you can:

  • run a processing job to run a Scikit-Learn script to clean, pre-process, perform feature engineering, and split the input data into train and test sets.
  • run a training job on the pre-processed training data to train a model model
  • run a processing job on the pre-processed test data to evaluate the trained model's performance
  • use your own custom container with to run processing jobs with your own Python libraries and dependencies.

The dataset used is the Census-Income KDD Dataset. We will select features from this dataset, clean the data, and turn the data into features that our training algorithm can use to train a binary classification model, and split the data into train and test sets.

The task is to predict whether rows representing census responders have an income greater than $50K, or less than 50K. The dataset is heavily class imbalanced, with most records being labeled as earning less than $50K. After training a logistic regression model, we will evaluate the model against a hold-out test dataset, and save the classification evaluation metrics, including precision, recall, and F1 score for each label, and accuracy and ROC AUC for the model.