Support multiple validation datasets when `dataloader_persistent_workers=True` #30627

bastienlc · 2024-05-02T20:25:18Z

What does this PR do?

This PR adds support for multiple validation datasets in transformers.Trainer when dataloader_persistent_workers=True. The current behavior is that all validations will be done on the first validation dataset.

I had to pass the validation_dataset_name to the evaluate and get_eval_dataloader methods so that the right validation dataloader could be retrieved.

Who can review?

@muellerzr

HuggingFaceDocBuilderDev · 2024-05-02T20:55:46Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

github-actions · 2024-06-02T08:02:56Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

muellerzr

Hi! Terribly sorry for missing this.

From an API perspective I think we can be even simpler than this however.

What if we set eval_dataset to be an Optional[Dataset, str] and specify how a str dataset key is used/passed. I don't think we need to go as far as a new argument. WDYT?

cc @amyeroberts also for thoughts.

And terribly sorry this went under the radar for so long, I appreciate the ping in the issue

amyeroberts · 2024-06-07T18:07:02Z

@muellerzr I like the sound of this solution :)

bastienlc · 2024-06-08T14:32:16Z

Hi, sure that would be better, something like what I just pushed ?

Or maybe we'd be better off hashing the evaluation dataset like datasets does for caching ? That would avoid having to check multiple cases.

amyeroberts · 2024-06-10T16:40:42Z

@bastienlc The current state LGTM. Let's get @muellerzr's opinion though - as he know's the trainer better than me :)

muellerzr

Thanks for the refactor! Change indeed looks good to me here, can you rebase from main so we can check that the CI is ✅ ?

…rs=True

muellerzr · 2024-06-11T21:52:47Z

Great! cc @amyeroberts for final review :)

amyeroberts

Thanks for adding!

Only thing to add is a test to make sure evaluate behaves as expected for the different types of eval_dataset

bastienlc force-pushed the fix-multi-eval-datasets branch from 1e394f7 to 2b67e83 Compare May 5, 2024 16:13

bastienlc mentioned this pull request Jun 2, 2024

Multiple validation datasets unsupported with dataloader_persistent_workers=True #30527

Open

4 tasks

muellerzr reviewed Jun 6, 2024

View reviewed changes

bastienlc force-pushed the fix-multi-eval-datasets branch from 2b67e83 to b224b53 Compare June 8, 2024 14:25

muellerzr approved these changes Jun 11, 2024

View reviewed changes

Support multiple validation datasets when dataloader_persistent_worke…

183708e

…rs=True

bastienlc force-pushed the fix-multi-eval-datasets branch from b224b53 to 183708e Compare June 11, 2024 20:45

muellerzr requested a review from amyeroberts June 11, 2024 21:52

amyeroberts reviewed Jun 12, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multiple validation datasets when `dataloader_persistent_workers=True` #30627

Support multiple validation datasets when `dataloader_persistent_workers=True` #30627

bastienlc commented May 2, 2024

HuggingFaceDocBuilderDev commented May 2, 2024

github-actions bot commented Jun 2, 2024

muellerzr left a comment

amyeroberts commented Jun 7, 2024

bastienlc commented Jun 8, 2024

amyeroberts commented Jun 10, 2024

muellerzr left a comment

muellerzr commented Jun 11, 2024

amyeroberts left a comment

Support multiple validation datasets when dataloader_persistent_workers=True #30627

Are you sure you want to change the base?

Support multiple validation datasets when dataloader_persistent_workers=True #30627

Conversation

bastienlc commented May 2, 2024

What does this PR do?

Who can review?

HuggingFaceDocBuilderDev commented May 2, 2024

github-actions bot commented Jun 2, 2024

muellerzr left a comment

Choose a reason for hiding this comment

amyeroberts commented Jun 7, 2024

bastienlc commented Jun 8, 2024

amyeroberts commented Jun 10, 2024

muellerzr left a comment

Choose a reason for hiding this comment

muellerzr commented Jun 11, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

Support multiple validation datasets when `dataloader_persistent_workers=True` #30627

Support multiple validation datasets when `dataloader_persistent_workers=True` #30627