Fix incorrect test set values in leave_k_out splits with sparse user rows #640

chrisjkuch · 2022-12-23T23:13:47Z

Closes #639

This PR fixes a bug in the evaluation of the leave_k_out_split in which the produced test matrix would contain values that were many multiples of their original value. Tests are also added on static (non-random) matrices that otherwise fail in the un-corrected implementation.

This bug resulted from a calculation that required an input array with sequential values - the fact that non-sequential values were provided led to an error in processing.

Specifically, the arr argument in _take_tails

----------
arr: ndarray
    The input array. This should be an array of integers in the range 0->n, where
    the ordered unique set of integers in said array should produce an array of
    consecutive integers. Concretely, the array [1, 0, 1, 1, 0, 3] would be invalid,
    but the array [1, 0, 1, 1, 0, 2] would not be.

was being provided as candidate_users, from which user indices falling below the threshold were removed, resulting in a list in which the ordered set of unique integers was not consecutive and therefore the provided array was invalid.

…lusion

chrisjkuch · 2023-01-12T00:00:01Z

It looks like a single test failed in one of the builds:

=================================== FAILURES ===================================
________________ test_leave_k_out_gets_correct_train_only_shape ________________

    def test_leave_k_out_gets_correct_train_only_shape():
        """Test that the correct number of users appear *only* in the train set."""
    
        mat = _get_matrix()
        train, test = leave_k_out_split(mat, K=1, train_only_size=0.8)
        train_only = ~np.isin(np.unique(train.tocoo().row), test.tocoo().row)
    
>       assert train_only.sum() == int(train.shape[0] * 0.8)
E       assert 81 == 80

Given that the tests are only failing in a single build and passing in all others, my guess is that a completely null row was present in the randomly generated matrix that caused it to be included in the training set in addition to all other randomly chosen users. I think the best solution to this is to add a check onto _get_matrix() that ensures that there aren't any completely zero rows.

chrisjkuch · 2023-01-12T00:15:12Z

Hmmm, even after making the fix to ensure always-populated rows, the test is still failing intermittently, and it's failing intermittently both for the random 100x100 sparse matrix as well as the newly-added fixed matrix. My guess is that this is some function of the combination of the candidate_mask with the train_only mask in which we are making an assumption that all users are eligible to be included in the test set when this is not the case, so more end up in the train set than we plan for?

Usually, the value for the number of users is only slightly off of the chosen value. @benfred would an almostEquals here with a delta of ~5 or so suffice?

chrisjkuch · 2023-06-15T17:34:18Z

@benfred Looks like this last failing test is addressed by #652, and some of the fixes in this PR are in effect duplicated by #653. Let me know if / how you'd like to proceed in fixing the functionality of leave_k_out_split, happy to help in any way I can.

chrisjkuch added 3 commits December 23, 2022 16:51

reindex candidate users included in test set to prevent duplicate inc…

2167fcf

…lusion

add static matrix tests

c054a36

format

3e29b83

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix incorrect test set values in leave_k_out splits with sparse user rows #640

Fix incorrect test set values in leave_k_out splits with sparse user rows #640

chrisjkuch commented Dec 23, 2022

chrisjkuch commented Jan 12, 2023

chrisjkuch commented Jan 12, 2023 •

edited

chrisjkuch commented Jun 15, 2023

Fix incorrect test set values in leave_k_out splits with sparse user rows #640

Are you sure you want to change the base?

Fix incorrect test set values in leave_k_out splits with sparse user rows #640

Conversation

chrisjkuch commented Dec 23, 2022

chrisjkuch commented Jan 12, 2023

chrisjkuch commented Jan 12, 2023 • edited

chrisjkuch commented Jun 15, 2023

chrisjkuch commented Jan 12, 2023 •

edited