Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote training recovery from interruptions #505

Open
dberenbaum opened this issue Mar 22, 2023 · 3 comments
Open

Remote training recovery from interruptions #505

dberenbaum opened this issue Mar 22, 2023 · 3 comments
Labels
A: checkpoints Area: `live.make_checkpoint` p2-medium

Comments

@dberenbaum
Copy link
Contributor

Related:

If you are training remotely and the machine shuts down, there's often no way to recover the last saved checkpoint on the new remote machine.

We have the tools to make it possible to recover in that scenario (without using DVC checkpoints) if we do something like:

  1. Each time the model is saved, DVCLive pushes the model to the remote and the metadata about it to Studio as part of live metrics updates. If the training is interrupted, all this info has been saved.
  2. When resuming training using Live(resume=True), DVCLive can fetch the model using the info saved in step 1 if there is no model in the workspace.

We need some mechanism to tie the resumed experiment to the interrupted experiment. Is the experiment revision consistent between them? Should we require an experiment name be passed to tie them together?

@dberenbaum
Copy link
Contributor Author

See also https://docs.wandb.ai/guides/runs/resuming for ideas/comparison.

@daavoo
Copy link
Contributor

daavoo commented Mar 23, 2023

We have the tools to make it possible to recover in that scenario (without using DVC checkpoints) if we do something like:

To clarify, you mean that we have all the pieces to implement it, right?

Should we require an experiment name be passed to tie them together?
See also https://docs.wandb.ai/guides/runs/resuming for ideas/comparison.

I think we could just:

  • resume=True == Try to resume from the workspace.
  • resume="{exp_name}" == Try to resume from remote model / studio info.

@mikolajpabiszczak
Copy link

Since this is open let me share how I handled that problem at the moment - as an idea/comparison.

Training is done on AWS EC2 instances with the use of Keras/TensorFlow.

The backup and restoration of the models is handled by the keras.callbacks.BackupAndRestore, the backups are not tracked by DVC, instead they are saved on the EFS that is attached to the instance, as EFS persists when instance gets terminated (BTW, other EFS is used as DVC cache). Now, one needs also a way to backup and restore the DVCLive progress. This requires some minor hacking as DVCLive does not communicate with BackupAndRestore callback. How this is done:

  1. assuming the DVCLive progress backup exists (in EFS!) copy that backup to the training repo DVCLive location and then (this order is important as of the current implementation of DVCLiveCallback), declare an instance of DVCLiveCallback (for keras) and append it to the list of callbacks (DVCLiveCallback looks for the existing progress upon declaration, so the backup needs to be restored before).
  2. Write additional callback (say DVCLiveCheckpoint) that will on_epoch_end copy the DVCLive files to the EFS as a backup (by the order of the callbacks in the list I make sure that this is done after the model backup).

Now the final issue: assuming the training is triggered by GitHub action which uses CML to deploy EC2 instance, ... - what is the way to find the correct backup on EFS? Easy: use commit SHA, so for instance, store the backups on EFS inside the {COMMIT_SHA} dir.

And final polish: BackupAndRestore delates any backups after the successful training is done. So for the DVCLiveCheckpoint implement on_train_end method that will also delate the DVCLive backup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: checkpoints Area: `live.make_checkpoint` p2-medium
Projects
None yet
Development

No branches or pull requests

3 participants