Remote training recovery from interruptions #505

dberenbaum · 2023-03-22T15:19:18Z

We have the tools to make it possible to recover in that scenario (without using DVC checkpoints) if we do something like:

Each time the model is saved, DVCLive pushes the model to the remote and the metadata about it to Studio as part of live metrics updates. If the training is interrupted, all this info has been saved.
When resuming training using Live(resume=True), DVCLive can fetch the model using the info saved in step 1 if there is no model in the workspace.

We need some mechanism to tie the resumed experiment to the interrupted experiment. Is the experiment revision consistent between them? Should we require an experiment name be passed to tie them together?

The text was updated successfully, but these errors were encountered:

dberenbaum · 2023-03-22T17:36:12Z

See also https://docs.wandb.ai/guides/runs/resuming for ideas/comparison.

daavoo · 2023-03-23T08:40:29Z

We have the tools to make it possible to recover in that scenario (without using DVC checkpoints) if we do something like:

To clarify, you mean that we have all the pieces to implement it, right?

Should we require an experiment name be passed to tie them together?
See also https://docs.wandb.ai/guides/runs/resuming for ideas/comparison.

I think we could just:

resume=True == Try to resume from the workspace.
resume="{exp_name}" == Try to resume from remote model / studio info.

mikolajpabiszczak · 2023-08-08T10:02:31Z

Since this is open let me share how I handled that problem at the moment - as an idea/comparison.

Training is done on AWS EC2 instances with the use of Keras/TensorFlow.

The backup and restoration of the models is handled by the keras.callbacks.BackupAndRestore, the backups are not tracked by DVC, instead they are saved on the EFS that is attached to the instance, as EFS persists when instance gets terminated (BTW, other EFS is used as DVC cache). Now, one needs also a way to backup and restore the DVCLive progress. This requires some minor hacking as DVCLive does not communicate with BackupAndRestore callback. How this is done:

assuming the DVCLive progress backup exists (in EFS!) copy that backup to the training repo DVCLive location and then (this order is important as of the current implementation of DVCLiveCallback), declare an instance of DVCLiveCallback (for keras) and append it to the list of callbacks (DVCLiveCallback looks for the existing progress upon declaration, so the backup needs to be restored before).
Write additional callback (say DVCLiveCheckpoint) that will on_epoch_end copy the DVCLive files to the EFS as a backup (by the order of the callbacks in the list I make sure that this is done after the model backup).

Now the final issue: assuming the training is triggered by GitHub action which uses CML to deploy EC2 instance, ... - what is the way to find the correct backup on EFS? Easy: use commit SHA, so for instance, store the backups on EFS inside the {COMMIT_SHA} dir.

And final polish: BackupAndRestore delates any backups after the successful training is done. So for the DVCLiveCheckpoint implement on_train_end method that will also delate the DVCLive backup.

dberenbaum added p2-medium A: checkpoints Area: `live.make_checkpoint` labels Mar 22, 2023

dberenbaum mentioned this issue Mar 22, 2023

Drop/Revisit usage of checkpoints iterative/dvc#9221

Closed

dberenbaum mentioned this issue Apr 4, 2023

studio: push exp refs iterative/dvc#9295

Closed

daavoo mentioned this issue May 2, 2023

integrations: Load model_file for resuming #140

Closed

10 tasks

dberenbaum mentioned this issue Dec 1, 2023

Feature exp run: Dryer resume within the CI iterative/dvc#6823

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remote training recovery from interruptions #505

Remote training recovery from interruptions #505

dberenbaum commented Mar 22, 2023

dberenbaum commented Mar 22, 2023

daavoo commented Mar 23, 2023 •

edited

mikolajpabiszczak commented Aug 8, 2023

Remote training recovery from interruptions #505

Remote training recovery from interruptions #505

Comments

dberenbaum commented Mar 22, 2023

dberenbaum commented Mar 22, 2023

daavoo commented Mar 23, 2023 • edited

mikolajpabiszczak commented Aug 8, 2023

daavoo commented Mar 23, 2023 •

edited