cafeen

This repository presents an approach used for solving Kaggle Categorical Feature Encoding Challenge II.

Cross-validation scheme

To validate the results, I divided train dataset (600000 rows) into two sets having 300000 rows each. I repeated this operation 4 times using different random_seed and calculated CV score as a mean score over 4 iterations.

from sklearn.metrics import roc_auc_score
from cafeen import config, steps

scores = []

for seed in [0, 1, 2, 3]:
    # read data from files
    train_x, test_x, train_y, test_y, test_id = steps.make_data(
        path_to_train=config.path_to_train,
        seed=seed,
        drop_features=['bin_3'])
    # apply encoders
    train_x, test_x = steps.encode(train_x, train_y, test_x, is_val=True)
    # apply estimator
    predicted = steps.train_predict(train_x, train_y, test_x)
    # compute ROC AUC score
    scores += [roc_auc_score(test_y.values, predicted)]

Encoding pipeline

The full encoding pipeline can be seen here.

Score improvements

Baseline

As a baseline model, I used logistic regression with default parameters and liblinear solver. All features in dataset are one-hot encoded.

CV: 0.78130, private score: 0.78527

Tuning hyperparameters

After hyperparameters optimization, I found the following configuration yields a highest CV score.

from sklearn.linear_model import LogisticRegression

estimator = LogisticRegression(
    C=0.049,
    class_weight={0: 1, 1: 1.42},
    solver='liblinear',
    fit_intercept=True,
    penalty='l2')

CV: 0.78519, private score: 0.78704

Drop bin_3

I dropped bin_3 feature, as it seems to be not really important, and keeping it in the dataset doesn't improve the score.

CV: 0.78520, private score: 0.78704

Ordinal encoding

I used ordinal encoding for ord_0, ord_1, ord_4, ord_5, approximating categories target mean with a linear function. For ord_4 and ord_5 I removed outliers, categories with small amount of observations, before applying the linear regression.

CV: 0.78582, private score: 0.78727

Grouping

For nom_6 feature I removed all categories which have less than 90 observations (replaced it with NaN). Then using K-Fold target encoding, converted it to numeric and grouped in three groups with qcut.

import pandas as pd

x['nom_6'] = pd.qcut(x['nom_6'], 3, labels=False, duplicates='drop')

CV: 0.78691, private score: 0.78796

Filtering

For nom_9 feature I removed all categories which have less than 60 observations (replaced it with NaN) and combined together categories which have equal target average.

CV: 0.78691, private score: 0.78797

Missing values

For one-hot encoded features (all features except ord_0, ord_1, ord_4, ord_5), I replaced missing values with -1. For ordinal encoded features, I replaced it with the target probability, 0.18721.

Results

That's it, though I haven't chosen the best submission for final score and the official results are a bit worse.

Private score 0.78795 (110 place) Public score 0.78669 (22 place)

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
cafeen		cafeen
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cafeen

cafeen

notebooks

notebooks

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

setup.py

setup.py

Repository files navigation

cafeen

Cross-validation scheme

Encoding pipeline

Score improvements

Baseline

Tuning hyperparameters

Drop bin_3

Ordinal encoding

Grouping

Filtering

Missing values

Results

About

Releases

Packages

Languages

License

viktorsapozhok/cafeen

Folders and files

Latest commit

History

Repository files navigation

cafeen

Cross-validation scheme

Encoding pipeline

Score improvements

Baseline

Tuning hyperparameters

Drop bin_3

Ordinal encoding

Grouping

Filtering

Missing values

Results

About

Topics

Resources

License

Stars

Watchers

Forks

Languages