You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Notes
Akshat found that one can end up with predictions of 0 for all inputs when loading a model from a best.pt file.
@hwpang found that this is somehow related to using --devices "1," vs --devices "1". If the former is used during training and the latter is used during predicting, the mean and scale matrices in the unscale transform got put onto two different GPUs.
model loaded onto GPU 1
tensor([[-92.4739]], device='cuda:1')
tensor([[77.8905]], device='cuda:1')
GPU 0 is used by predicting
tensor([[0.]], device='cuda:0')
tensor([[0.]], device='cuda:0')
From @JacksonBurns:
This has to do with how lightning will load models from checkpoint and the default behavior of the map_location. This is technically intended behavior on their side, we just aren't providing the map location.
If there's no fool proof thing we can do to prevent this from happening for the users, we should add a warning to our documentation for users to make sure they specify their device numbers consistently between training and prediction.
The text was updated successfully, but these errors were encountered:
my vote is for warning. The two inputs have different meanings, and it's not up to us to try and guess what a user actually meant to type. This can lead to inconsistencies down the road, e.g., should we also assume the same behavior with chemprop train? IMO the answer is clearly no, and we want the argument to be treated the same across scripts.
Notes
Akshat found that one can end up with predictions of 0 for all inputs when loading a model from a
best.pt
file.@hwpang found that this is somehow related to using
--devices "1,"
vs--devices "1"
. If the former is used during training and the latter is used during predicting, the mean and scale matrices in the unscale transform got put onto two different GPUs.From @JacksonBurns:
This has to do with how lightning will load models from checkpoint and the default behavior of the
map_location
. This is technically intended behavior on their side, we just aren't providing the map location.If there's no fool proof thing we can do to prevent this from happening for the users, we should add a warning to our documentation for users to make sure they specify their device numbers consistently between training and prediction.
The text was updated successfully, but these errors were encountered: