Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix incorrect test set values in leave_k_out splits with sparse user rows #640

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

chrisjkuch
Copy link

Closes #639

This PR fixes a bug in the evaluation of the leave_k_out_split in which the produced test matrix would contain values that were many multiples of their original value. Tests are also added on static (non-random) matrices that otherwise fail in the un-corrected implementation.

This bug resulted from a calculation that required an input array with sequential values - the fact that non-sequential values were provided led to an error in processing.

Specifically, the arr argument in _take_tails

----------
arr: ndarray
    The input array. This should be an array of integers in the range 0->n, where
    the ordered unique set of integers in said array should produce an array of
    consecutive integers. Concretely, the array [1, 0, 1, 1, 0, 3] would be invalid,
    but the array [1, 0, 1, 1, 0, 2] would not be.

was being provided as candidate_users, from which user indices falling below the threshold were removed, resulting in a list in which the ordered set of unique integers was not consecutive and therefore the provided array was invalid.

@chrisjkuch
Copy link
Author

It looks like a single test failed in one of the builds:

=================================== FAILURES ===================================
________________ test_leave_k_out_gets_correct_train_only_shape ________________

    def test_leave_k_out_gets_correct_train_only_shape():
        """Test that the correct number of users appear *only* in the train set."""
    
        mat = _get_matrix()
        train, test = leave_k_out_split(mat, K=1, train_only_size=0.8)
        train_only = ~np.isin(np.unique(train.tocoo().row), test.tocoo().row)
    
>       assert train_only.sum() == int(train.shape[0] * 0.8)
E       assert 81 == 80

Given that the tests are only failing in a single build and passing in all others, my guess is that a completely null row was present in the randomly generated matrix that caused it to be included in the training set in addition to all other randomly chosen users. I think the best solution to this is to add a check onto _get_matrix() that ensures that there aren't any completely zero rows.

@chrisjkuch
Copy link
Author

chrisjkuch commented Jan 12, 2023

Hmmm, even after making the fix to ensure always-populated rows, the test is still failing intermittently, and it's failing intermittently both for the random 100x100 sparse matrix as well as the newly-added fixed matrix. My guess is that this is some function of the combination of the candidate_mask with the train_only mask in which we are making an assumption that all users are eligible to be included in the test set when this is not the case, so more end up in the train set than we plan for?

Usually, the value for the number of users is only slightly off of the chosen value. @benfred would an almostEquals here with a delta of ~5 or so suffice?

@chrisjkuch
Copy link
Author

@benfred Looks like this last failing test is addressed by #652, and some of the fixes in this PR are in effect duplicated by #653. Let me know if / how you'd like to proceed in fixing the functionality of leave_k_out_split, happy to help in any way I can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

leave_k_out_split produces incorrect values in test set
1 participant