Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve binning in binarize() #44

Open
cthoyt opened this issue Jan 9, 2022 · 0 comments
Open

Improve binning in binarize() #44

cthoyt opened this issue Jan 9, 2022 · 0 comments

Comments

@cthoyt
Copy link
Contributor

cthoyt commented Jan 9, 2022

The current binarize function uses a cutoff of 0.5 for binarization:

rexmex/rexmex/utils.py

Lines 28 to 34 in 3e26652

def metric_wrapper(*args, **kwargs):
# TODO: Move to optimal binning. Youden’s J statistic.
y_score = args[1]
y_score[y_score < 0.5] = 0
y_score[y_score >= 0.5] = 1
score = metric(*args, **kwargs)
return score

This is an issue for PyKEEN, where the scores that come from a model could all be on the range of [-5,-2]. The current TODO text says to use https://en.wikipedia.org/wiki/Youden%27s_J_statistic, but it's not clear how that would be used.

As an alternative, the NetMF package implements the following code for constructing an indicator that might be more applicable (though I don't personally recognize what method this is, and unfortunately it's not documented):

def construct_indicator(y_score, y):
    # rank the labels by the scores directly
    num_label = np.sum(y, axis=1, dtype=np.int)
    y_sort = np.fliplr(np.argsort(y_score, axis=1))
    y_pred = np.zeros_like(y, dtype=np.int)
    for i in range(y.shape[0]):
        for j in range(num_label[i]):
            y_pred[i, y_sort[i, j]] = 1
    return y_pred
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant