Unable to use `records_per_epoch` for scheduling? 🤔[question] #6869

charles-viss · 2023-05-17T19:24:29Z

Describe your question

According to the docs records_per_epoch can be used to schedule validation and checkpoint frequencies in conjunction with the epoch scheduling unit in the config file: https://docs.determined.ai/latest/reference/reference-training/experiment-config-reference.html?highlight=gc%20policy#config-records-per-epoch. However, upon upgrading to a newer version of Determined, experiments seem to ignore the records_per_epoch field and instead define an epoch by the length of the dataloader. Is there a way to still use records_per_epoch to define epoch length instead?

The text was updated successfully, but these errors were encountered:

azhou-determined · 2023-05-17T19:37:06Z

This was deprecated for PyTorch trials in 0.21.0. The doc you listed needs to be updated to reflect this, and I will make a ticket to do that.

Previously, this value was only necessary because our code made it impossible for us to determine the length of the dataset at the time we initialized the trial. But due to some refactoring, we no longer have this lifecycle issue and opted to not require users to give us this value. You are correct that we now use the chief worker's dataset length to determine this epoch length.

records_per_epoch is no longer supported for PyTorch trials. If you need the length of the epoch to be different than the length of the dataloader, you can configure the length in batches, or modify the length of the actual Dataloader.

charles-viss · 2023-05-17T19:41:10Z

This still seems undesirable though because as you mention in the docs, the length of a dataset/dataloader can vary depending on your augmentation or sampling strategy

garrett361 · 2023-05-17T19:57:17Z

Hi @charles-viss, can you expand upon your use case a bit more? Is it that you have a fixed dataset, but then run some augmentation so that the size of the dataset is effectively expanded? If so, is it safe to assume that the size of the expanded dataset is the same for every epoch, or will that also vary?

Thinking about how we might best address your scenario, but would like to make sure we understand the situation precisely, first.

garrett361 · 2023-05-17T19:57:57Z

Also, which version of determined are you currently running?

charles-viss · 2023-05-17T20:20:10Z

For example, one use case is training over a fixed dataset with or without category-weighted sampling. Because custom data samplers change the size of the dataloader, i've used the records_per_epoch parameter to set a universal epoch length and ensure consistent training times across experiments. The alternative like you've said is to do all scheduling in batches, which is doable, but would require rewrites in several experiments where I've used the epoch unit to control things like lr scheduling or augmentation strategies.

charles-viss · 2023-05-17T20:20:58Z

One helpful feature could be to have the length of an epoch defined by the length of the dataloader by default, but then could be overridden if a records_per_epoch parameter is provided. This would at least enable backward compatibility. We are currently unable to fork experiments or use config files from before the version update due to this scheduling change.

charles-viss · 2023-05-17T20:21:43Z

We are currently using determined version 0.21.2

garrett361 · 2023-05-18T02:12:28Z

In 0.21.2, I think converting to using batches in all cases is the most robust solution, though I understand this will necessitate some unfortunate refactoring of your current code. Our apologies!

One helpful feature could be to have the length of an epoch defined by the length of the dataloader by default, but then could be overridden if a records_per_epoch parameter is provided.

Thank you for the suggestion, we will consider adding such a feature.

One more question about your use case: are you using WeightedRandomSampler in the scenario described above? If so, can you use its num_samples argument to keep a fixed epoch size, rather than determined's records_per_epoch? I understand that this does not address the issue regarding forking, in any case.

charles-viss · 2023-05-18T17:14:56Z

In one situation we use a WeightedRandomSampler where that could work. In other cases though, such as when using a custom data sampler for triplet loss, the data samplers have essentially infinite length and a switch to batches for scheduling will be necessary

charles-viss added the question label May 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to use `records_per_epoch` for scheduling? 🤔[question] #6869

Unable to use `records_per_epoch` for scheduling? 🤔[question] #6869

charles-viss commented May 17, 2023

azhou-determined commented May 17, 2023

charles-viss commented May 17, 2023

garrett361 commented May 17, 2023

garrett361 commented May 17, 2023

charles-viss commented May 17, 2023

charles-viss commented May 17, 2023 •

edited

charles-viss commented May 17, 2023

garrett361 commented May 18, 2023

charles-viss commented May 18, 2023

Unable to use records_per_epoch for scheduling? 🤔[question] #6869

Unable to use records_per_epoch for scheduling? 🤔[question] #6869

Comments

charles-viss commented May 17, 2023

Describe your question

azhou-determined commented May 17, 2023

charles-viss commented May 17, 2023

garrett361 commented May 17, 2023

garrett361 commented May 17, 2023

charles-viss commented May 17, 2023

charles-viss commented May 17, 2023 • edited

charles-viss commented May 17, 2023

garrett361 commented May 18, 2023

charles-viss commented May 18, 2023

Unable to use `records_per_epoch` for scheduling? 🤔[question] #6869

Unable to use `records_per_epoch` for scheduling? 🤔[question] #6869

charles-viss commented May 17, 2023 •

edited