k-fold CV #1574

mmosc · 2022-12-06T13:50:18Z

Hi!

For my own research, I implemented k-fold CV in your library. Have a look and let me know if it looks good for you! Maybe someone else will also need it.

Cheers,
Marta

AoiDragon · 2022-12-12T02:15:03Z

@mmosc Hello, thanks for your attention to RecBole.

I tested your code and found some questions. K-fold CV means seperate the dataset into K groups and chooses one group as the test set in turn. However, I found that your code only divides the dataset and fixes the test set. So actually the code works like a RS split which have a different split ratio.

mmosc · 2022-12-15T15:02:31Z

Hi @AoiDragon !
Thanks for having a look at this.

Actually when calling data_preparation in utils.py, the data gets split in folds, and depending on the value of k = config["eval_args"]["fold"] in the overall.yaml, those get rolled in a round robin fashion. After rolling, valid_dataset becomes the second to last fold, and test_dataset becomes the last fold.

So overall, depending on config["eval_args"]["fold"], you are choosing different folds for val and test.

Do you agree on this?

AoiDragon · 2022-12-18T03:36:17Z

Hello @mmosc ,

In my understanding, your method need to call data_preparation mutiple times. However, in default quick_start the data_preparation will only be called once, so under default configuration it just divides the dataset once and fixes train, valid and test set. This may mislead users and I didn't see code about this part, so I'm afraid this version of your commit is unsuitable for merging.

mmosc · 2022-12-18T11:55:36Z

Hi @AoiDragon ,
When calling data_preparation with KF as parameter, the following code gets executed:

        if list(config["eval_args"]["split"].keys())[0] == 'KF':
            print("==KF")
            folds = dataset.build()# data_preparation(config, dataset)
            n_folds = len(folds)
            print(n_folds)
            k = config["eval_args"]["fold"]
            folds = folds[k:] + folds[:k]

The k folds are generated when calling dataset.build(), in which the following happens:

        elif split_mode == "KF":
            """
            Will return n_folds datasets
            """
            if not isinstance(split_args["KF"], list):
                raise ValueError(f'The value of "KF" [{split_args}] should be a list.')
            if group_by is None or group_by.lower() == "none":
                datasets = self.split_by_ratio(split_args["KF"], group_by=None)
            elif group_by == "user":
                datasets = self.split_by_ratio(
                    split_args["KF"], group_by=self.uid_field
                )
            else:
                raise NotImplementedError(
                    f"The grouping method [{group_by}] has not been implemented."
                )

This is where split_by_ratio is called (only once) with the percentages of the k-folds as argument. This is the part that will return as many folds as passed by the list with percentages. Calling it only once suffices.

mmosc added 5 commits December 2, 2022 10:46

feat: add k-fold cv.

ae0b846

edit: Add instructions for K-fold CV in the README.

5878883

edit: Edit README.

a5bf822

fix: bug in k-fold.

6fe4f96

edit: Change fold

732fa1f

AoiDragon self-assigned this Dec 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k-fold CV #1574

k-fold CV #1574

mmosc commented Dec 6, 2022

AoiDragon commented Dec 12, 2022

mmosc commented Dec 15, 2022

AoiDragon commented Dec 18, 2022

mmosc commented Dec 18, 2022 •

edited

k-fold CV #1574

Are you sure you want to change the base?

k-fold CV #1574

Conversation

mmosc commented Dec 6, 2022

AoiDragon commented Dec 12, 2022

mmosc commented Dec 15, 2022

AoiDragon commented Dec 18, 2022

mmosc commented Dec 18, 2022 • edited

mmosc commented Dec 18, 2022 •

edited