target statistic encoding

Install

from pypi

pip install target-statistic-encoding

from source

python -m pip install git+https://github.com/CircArgs/target_statistic_encoding.git

What?

There are many means to convert categorical features to numeric ones from one-hot to embeddings. Then there are target statistic methods. These methods take statistics based on the target feature.

Why?

Even within this simple technique there is variation in implementations. Some implement a time-mimicking approach such as Catboost to gain robustness over target leakage. However, one issue with this approach is that while it introduces some variation to the encoding, for a some samples the statistic is possibly excessively biased. This small package takes a different approach for this reason. Instead, it uses stratified folds of the training set and aggregates target statistics on each fold independently.

Benefits of this implementation

stratified split target statistic helps prevent target leakage thus making your models more robust
credibility factor allows categories with low support to be ignored additionally making your models more robust
clean api
variety of target statistic functions in addition to allowing custom implemented ones

easy productionalization - everything is 100% serializable with pickle ex.

#save for prod/test time environment
pd.to_pickle(cat2num, "cat2num_for_production.pkl")

#read into prod env
cat2num=pd.read_pickle("cat2num_for_production.pkl")
...
model.predict(cat2num.transform(prod_data))

How?

This is just a simple utility library that performs the following sample operation: See this example notebook

keep in mind this is simply an example. The example target is random here so no real signal is expected

API

Instantiate

Init signature:
Cat2Num(
    cat_vars: List[str],
    target_var: str,
    stat_func: target_statistic_encoding.stat_funcs.stat_funcs._StatFunc = <function mean.<locals>.stat_func at 0x7fea58a85950>,
)
Args:
    cat_vars (List[str]): a list of strings representing the categorical features to be encoded
    target_var (str): string of the name of the target feature in `data`
    stat_func (optional Function(*args, **kwargs) -> Function({pd.Series, pd.DataFrameGroupBy}) -> {float, pd.Series})): function which returns a closure to aggregate statistics over target - default stat_funcs.mean()

fit

prefer`.fit_transform` on your training set

Note: running .fit followed by .transform on your training set is not equivalent to simply running .fit_transform. There wil be no differentiation amongst category statistics as they will all be mapped to the mean.

cat2num.fit_transform(
    data: pandas.core.frame.DataFrame,
    split: str = None,
    n_splits: int = 5,
    credibility: Union[float, int] = 0,
    drop: bool = False,
    suffix: str = '_Cat2Num',
    inplace: bool = False,
)

Args:
    data (pd.DataFrame): pandas dataframe with categorical features to convert to numeric target statistic
    split (str): name of a column to use in the data for folding the data.
        - if this is use then n_splits is ignored
    n_splits (int): number of splits to use for target statistic
    credibility (float or int):
        - if float must be in [0, 1] as % of fitting data considered credible to fit statistic to
        - if int must be >=0 as number of records in fitting data level must exist within to be credible
        - levels not above this threshold will be given the overall target mean
    drop (bool): drop the original columns
    suffix (str): a string to append to the end of an encoded column, default `'_Cat2Num'`
    inplace (bool): whether the transformation should be done inplace or return the transformed data, default `False`

Returns:
    the passed dataframe with encoded columns added if inplace is `False` else `None`

cat2num.fit(
    data: pandas.core.frame.DataFrame,
    credibility: Union[float, int] = 0,
)

Args:
    data (pd.DataFrame): pandas dataframe with categorical features to fit numeric target statistic from
    credibility (float or int):
        - if float must be in [0, 1] as % of fitting data considered credible to fit statistic to
        - if int must be >=0 as number of records in fitting data level must exist within to be credible
        - levels not above this threshold will be given the overall target mean

Returns:
    fit Cat2Num instance

use `.transform` on your non-training set

cat2num.transform(
    data: pandas.core.frame.DataFrame,
    drop: bool = False,
    suffix: str = '_Cat2Num',
    inplace: bool = False,
)

Args:
    data (pd.DataFrame): pandas dataframe with categorical features to convert to numeric target statistic
    drop (bool): drop the original columns
    suffix (str): a string to append to the end of an encoded column, default `'_Cat2Num'`
    inplace (bool): whether the transformation should be done inplace or return the transformed data, default `False`

Returns:
    the passed dataframe with encoded columns added if inplace is `False` else `None`

Custom target statistic functions

You may optionally opt for a target statistic based on a statistic other than the mean although this is usually unwanted/unnecessary.

Several are included and you can implement your own with a few considerations.

Given:

mean (target_statistic_encoding.stat_funcs.Mean()) - the default
median (target_statistic_encoding.stat_funcs.Median())
std (target_statistic_encoding.stat_funcs.Std())
var (target_statistic_encoding.stat_funcs.Var())
quantile (target_statistic_encoding.stat_funcs.Quantile(quantile=0.5))

Implement your own:

You may optionally implement your own target statistic function. It must be a callable that operates on the pandas.core.groupby.DataFrameGroupby type i.e. the result of a pandas.DataFrame.groupby e.g.: something akin to

	target
X1
a	0.287356
b	0.298795
c	0.336879
d	0.287037

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.github		.github
assets		assets
examples		examples
target_statistic_encoding		target_statistic_encoding
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README.rst		README.rst
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

target statistic encoding

Table of contents:

Install

from pypi

from source

What?

Why?

Benefits of this implementation

How?

API

Instantiate

fit

prefer`.fit_transform` on your training set

use `.transform` on your non-training set

Custom target statistic functions

Given:

Implement your own:

About

Releases

Packages

Languages

License

CircArgs/target_statistic_encoding

Folders and files

Latest commit

History

Repository files navigation

target statistic encoding

Table of contents:

Install

from pypi

from source

What?

Why?

Benefits of this implementation

How?

API

Instantiate

fit

prefer.fit_transform on your training set

use .transform on your non-training set

Custom target statistic functions

Given:

Implement your own:

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

prefer`.fit_transform` on your training set

use `.transform` on your non-training set

Packages