Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple validation datasets when dataloader_persistent_workers=True #30627

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

bastienlc
Copy link

What does this PR do?

Fixes #30527

This PR adds support for multiple validation datasets in transformers.Trainer when dataloader_persistent_workers=True. The current behavior is that all validations will be done on the first validation dataset.

I had to pass the validation_dataset_name to the evaluate and get_eval_dataloader methods so that the right validation dataloader could be retrieved.

Who can review?

@muellerzr

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link

github-actions bot commented Jun 2, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! Terribly sorry for missing this.

From an API perspective I think we can be even simpler than this however.

What if we set eval_dataset to be an Optional[Dataset, str] and specify how a str dataset key is used/passed. I don't think we need to go as far as a new argument. WDYT?

cc @amyeroberts also for thoughts.

And terribly sorry this went under the radar for so long, I appreciate the ping in the issue

@amyeroberts
Copy link
Collaborator

@muellerzr I like the sound of this solution :)

@bastienlc
Copy link
Author

Hi, sure that would be better, something like what I just pushed ?

Or maybe we'd be better off hashing the evaluation dataset like datasets does for caching ? That would avoid having to check multiple cases.

@amyeroberts
Copy link
Collaborator

@bastienlc The current state LGTM. Let's get @muellerzr's opinion though - as he know's the trainer better than me :)

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the refactor! Change indeed looks good to me here, can you rebase from main so we can check that the CI is ✅ ?

@muellerzr
Copy link
Contributor

Great! cc @amyeroberts for final review :)

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding!

Only thing to add is a test to make sure evaluate behaves as expected for the different types of eval_dataset

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multiple validation datasets unsupported with dataloader_persistent_workers=True
4 participants