Fix DDP prediction and checkpoint Issues #884

shihchengli · 2024-05-24T21:12:23Z

Description

Address issue #874. Please refer to the issue for details.

Relevant issues

#853 #874

Checklist

linted with flake8?
(if appropriate) unit tests added?

JacksonBurns

Some minor notes to be addressed.

chemprop/models/model.py

JacksonBurns · 2024-05-28T14:23:45Z

chemprop/cli/train.py

+ torch.distributed.destroy_process_group()
+
+ best_ckpt_path = trainer.checkpoint_callback.best_model_path
+ trainer = pl.Trainer(
+ logger=trainer_logger,
+ enable_progress_bar=True,
+ accelerator=args.accelerator,
+ devices=1,
+ )
+ model = build_model(args, train_loader.dataset, output_transform, input_transforms)
+ model = model.load_from_checkpoint(best_ckpt_path)
+ predss = trainer.predict(model, dataloaders=test_loader)


Please check my understanding - we can train and validate in DDP, but we will always test in single-GPU mode? This seems dubious for large datasets/models. Why not have each process reload the best model and continue in DDP?

Your understanding is correct. DDP uses a distributed sampler, which drops samples to ensure the number of batches divides evenly across the GPUs. This is acceptable during validation to measure model performance. However, for testing, we should evaluate every data point. Even if the data points are evenly distributed across different processes, we would need to save the predictions from different processes, merge them, and also find the indices from the sampler so we can save the results in a file with their SMILES ordering correctly. IMO, the inference of the D-MPNN model is not expensive, so I think using a single GPU here is fine.

I think lightning does the opposite of what you are describing - look at the note at the bottom of this section of their docs: https://lightning.ai/docs/pytorch/stable/common/lightning_module.html#testing

It says that samples will actually be duplicated when batches don't evenly divide across GPUs. It even suggests running validate (and I think it means to suggest for test, too) on only a single device to avoid this.

Regardless of this I think what we have here is sound. My above comment might be a good thing to think about in the future, but I don't think it's a big deal.

JacksonBurns · 2024-05-30T14:10:28Z

chemprop/models/model.py

JacksonBurns · 2024-05-30T14:15:00Z

@shihchengli please update branch, @KnathanM please take a quick look at this and then I think it should be ready to merge

KnathanM · 2024-06-03T21:46:01Z

chemprop/cli/train.py

+ model = build_model(args, train_loader.dataset, output_transform, input_transforms)
+ model = model.load_from_checkpoint(best_ckpt_path)


Suggested change

model = build_model(args, train_loader.dataset, output_transform, input_transforms)

model = model.load_from_checkpoint(best_ckpt_path)

model = model.load_from_checkpoint(best_ckpt_path)

I don't think you need to use build_model again. load_from_checkpoint is a class method so you could even do MPNN.load_from_checkpoint() without a model (though then you'd have to check if it is multicomponent). If you think model = model.load_from_checkpoint(best_ckpt_path) is unclear, you could consider model = model.__class__.load_from_checkpoint(best_ckpt_path).

Good catch! I noticed that we don't actually need to reload the model weights. The model for different processes in the DDP should have the same model weights, so we can just use it.

I think you do need to reload the model weights because we want to use the best model, not the last one for testing. When we do predss = trainer.predict(dataloaders=test_loader) the trainer remembers which model checkpoint is the best and uses that but if we make a new trainer I don't think it has that information and will use the most recent model weights.

@KnathanM You are right! Just change it back.

…o fix_ddp

shihchengli added 2 commits May 24, 2024 16:56

adding sync_dist=True into self.log

5360521

not use DDP during testing

ae61b0b

shihchengli added this to the v2.0.1 milestone May 24, 2024

shihchengli linked an issue May 24, 2024 that may be closed by this pull request

[v2 BUG]: LightningModule's DDP doesn't work #874

Closed

shihchengli marked this pull request as ready for review May 24, 2024 22:18

JacksonBurns requested changes May 28, 2024

View reviewed changes

check distributed initialization before syncing

8668f91

shihchengli requested a review from JacksonBurns May 29, 2024 17:03

JacksonBurns reviewed May 30, 2024

View reviewed changes

chemprop/models/model.py

Copy link

Member

JacksonBurns May 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

JacksonBurns approved these changes May 30, 2024

View reviewed changes

JacksonBurns requested a review from KnathanM May 30, 2024 14:15

shihchengli force-pushed the fix_ddp branch 2 times, most recently from fc5f7eb to 43ee8e2 Compare May 31, 2024 00:42

shihchengli added 3 commits May 31, 2024 13:42

adding sync_dist=True into self.log

ad41ca5

not use DDP during testing

94df947

check distributed initialization before syncing

df3d76d

shihchengli force-pushed the fix_ddp branch from 43ee8e2 to df3d76d Compare May 31, 2024 17:42

Merge branch 'main' into fix_ddp

32e1e17

KnathanM requested changes Jun 3, 2024

View reviewed changes

Merge branch 'main' into fix_ddp

bfd43b4

shihchengli force-pushed the fix_ddp branch from cb556ce to d4fea30 Compare June 4, 2024 21:15

Merge branch 'fix_ddp' of https://github.com/shihchengli/chemprop int…

fae7a93

…o fix_ddp

shihchengli force-pushed the fix_ddp branch from d4fea30 to 1544599 Compare June 4, 2024 21:23

no need to rebuild model for prediction within DDP

02f02b7

shihchengli force-pushed the fix_ddp branch from 1544599 to 02f02b7 Compare June 4, 2024 23:25

shihchengli requested a review from KnathanM June 4, 2024 23:26

Merge branch 'main' into fix_ddp

8476dea

KnathanM approved these changes Jun 5, 2024

View reviewed changes

KnathanM merged commit 9f755b0 into chemprop:main Jun 5, 2024
13 checks passed

shihchengli deleted the fix_ddp branch June 5, 2024 20:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DDP prediction and checkpoint Issues #884

Fix DDP prediction and checkpoint Issues #884

shihchengli commented May 24, 2024 •

edited

JacksonBurns left a comment

JacksonBurns May 28, 2024

shihchengli May 28, 2024 •

edited

JacksonBurns May 30, 2024

JacksonBurns May 30, 2024

JacksonBurns May 30, 2024

JacksonBurns commented May 30, 2024

KnathanM Jun 3, 2024

shihchengli Jun 4, 2024

KnathanM Jun 4, 2024

shihchengli Jun 4, 2024

		model = build_model(args, train_loader.dataset, output_transform, input_transforms)
		model = model.load_from_checkpoint(best_ckpt_path)

Fix DDP prediction and checkpoint Issues #884

Fix DDP prediction and checkpoint Issues #884

Conversation

shihchengli commented May 24, 2024 • edited

Description

Relevant issues

Checklist

JacksonBurns left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shihchengli May 28, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JacksonBurns commented May 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shihchengli commented May 24, 2024 •

edited

shihchengli May 28, 2024 •

edited