New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for TorchSnapshot for efficient checkpoint saving and loading #2752
Comments
@ananthsub thanks for suggesting this feature! Let us get a bit familiar with torch snapshot and see how this can be integrated to ignite. A question I have about the usage, in DDP user should call |
Yes, Snapshot.take should always be called on all ranks in a distributed setting. It acts as a collective.
The path specified should be a directory, which should be the same across all ranks. If on a multi-node setting, this assumes you have a storage system visible by all nodes (e.g. a cloud storage object store) |
馃殌 Feature
TorchSnapshot is a performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind. It includes many optimizations to control for memory usage and optimize checkpoint writing for DDP-style workloads over torch.save/torch.load. For more information, please check out the readme: https://github.com/pytorch/torchsnapshot#why-torchsnapshot
This could be a nice addition to Ignite, similar to the existing Checkpoint handler
cc @yifuwang
The text was updated successfully, but these errors were encountered: