Support for TorchSnapshot for efficient checkpoint saving and loading #2752

ananthsub · 2022-10-24T18:40:01Z

🚀 Feature

TorchSnapshot is a performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind. It includes many optimizations to control for memory usage and optimize checkpoint writing for DDP-style workloads over torch.save/torch.load. For more information, please check out the readme: https://github.com/pytorch/torchsnapshot#why-torchsnapshot

This could be a nice addition to Ignite, similar to the existing Checkpoint handler

cc @yifuwang

vfdev-5 · 2022-10-27T13:23:06Z

@ananthsub thanks for suggesting this feature! Let us get a bit familiar with torch snapshot and see how this can be integrated to ignite.

A question I have about the usage, in DDP user should call Snapshot.take by all ranks ? How about the path specified in the argument, where it should be, node 0, rank 0 ?

ananthsub · 2022-10-27T16:59:07Z

A question I have about the usage, in DDP user should call Snapshot.take by all ranks ?

Yes, Snapshot.take should always be called on all ranks in a distributed setting. It acts as a collective.

How about the path specified in the argument, where it should be, node 0, rank 0 ?

The path specified should be a directory, which should be the same across all ranks. If on a multi-node setting, this assumes you have a storage system visible by all nodes (e.g. a cloud storage object store)

ananthsub changed the title ~~Support TorchSnapshot for efficient checkpoint saving and loading~~ Support for TorchSnapshot for efficient checkpoint saving and loading Oct 24, 2022

vfdev-5 added enhancement help wanted labels Nov 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for TorchSnapshot for efficient checkpoint saving and loading #2752

Support for TorchSnapshot for efficient checkpoint saving and loading #2752

ananthsub commented Oct 24, 2022 •

edited

vfdev-5 commented Oct 27, 2022

ananthsub commented Oct 27, 2022

Support for TorchSnapshot for efficient checkpoint saving and loading #2752

Support for TorchSnapshot for efficient checkpoint saving and loading #2752

Comments

ananthsub commented Oct 24, 2022 • edited

🚀 Feature

vfdev-5 commented Oct 27, 2022

ananthsub commented Oct 27, 2022

ananthsub commented Oct 24, 2022 •

edited