Training with GPU #119

chiku-parida · 2023-07-31T12:34:29Z

chiku-parida
Jul 31, 2023

I have tried initializing my GPU in the beginning like below.

'''
if torch.cuda.is_available():
device = 'cuda'
else:
device = 'cpu'

print(f'The available device is {device}')
'''
The model is detecting the GPU correctly still I don't understand which tensors should be assigned ti GPU. I am getting the below error. Please Help!

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

SmallBearC · 2023-07-31T13:53:41Z

SmallBearC
Jul 31, 2023

@chiku-parida At the first, you should update the code, and then set the num_workers=0, this is very important. Also note that do not use GPUs more than one card. I think the reason for this issue is that there are some numpy arrays and tensor operations in the code. When using the default device as GPU, the numpy array will be in the CPU, while the tensor will be in the GPU, which will cause this error. If you want to use multiple cards, you need to convert all numpy arrays into tensors to operate and ensure that they are assigned to the correct GPU. If this doesn't work, perhaps you need to set generator="cuda" in MGLDataLoader. I have already modified the code, so I forgot how to modify the parameters specifically. I hope it will be helpful to you

5 replies

chiku-parida Aug 3, 2023
Author

Thanks @SmallBearC . I followed all your instructions still the error persists. I am adding the whole error output below. Please look into it.

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name  | Type              | Params
--------------------------------------------
0 | mae   | MeanAbsoluteError | 0     
1 | rmse  | MeanSquaredError  | 0     
2 | model | Potential         | 282 K 
--------------------------------------------
282 K     Trainable params
0         Non-trainable params
282 K     Total params
1.130     Total estimated model params size (MB)
Sanity Checking DataLoader 0: 0%
0/2 [00:00<?, ?it/s]
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[5], line 4
      2 logger = CSVLogger("logs", name="M3GNet_training")
      3 trainer = pl.Trainer(max_epochs=20, accelerator="cuda", logger=logger) #, accelerator="cpu")
----> 4 trainer.fit(model=lit_module.to(device), train_dataloaders=train_loader, val_dataloaders=val_loader)

File [~/miniconda3/envs/matgl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:529](https://file+.vscode-resource.vscode-cdn.net/home/charles/cp_matgl/~/miniconda3/envs/matgl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:529), in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    527 model = _maybe_unwrap_optimized(model)
    528 self.strategy._lightning_module = model
--> 529 call._call_and_handle_interrupt(
    530     self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    531 )

File [~/miniconda3/envs/matgl/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py:42](https://file+.vscode-resource.vscode-cdn.net/home/charles/cp_matgl/~/miniconda3/envs/matgl/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py:42), in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     40     if trainer.strategy.launcher is not None:
     41         return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
---> 42     return trainer_fn(*args, **kwargs)
     44 except _TunerExitException:
     45     _call_teardown_hook(trainer)

File [~/miniconda3/envs/matgl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:568](https://file+.vscode-resource.vscode-cdn.net/home/charles/cp_matgl/~/miniconda3/envs/matgl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:568), in Trainer._fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    558 self._data_connector.attach_data(
    559     model, train_dataloaders=train_dataloaders, val_dataloaders=val_dataloaders, datamodule=datamodule
    560 )
    562 ckpt_path = self._checkpoint_connector._select_ckpt_path(
    563     self.state.fn,
    564     ckpt_path,
    565     model_provided=True,
    566     model_connected=self.lightning_module is not None,
    567 )
--> 568 self._run(model, ckpt_path=ckpt_path)
    570 assert self.state.stopped
    571 self.training = False

File [~/miniconda3/envs/matgl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:973](https://file+.vscode-resource.vscode-cdn.net/home/charles/cp_matgl/~/miniconda3/envs/matgl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:973), in Trainer._run(self, model, ckpt_path)
    968 self._signal_connector.register_signal_handlers()
    970 # ----------------------------
    971 # RUN THE TRAINER
    972 # ----------------------------
--> 973 results = self._run_stage()
    975 # ----------------------------
    976 # POST-Training CLEAN UP
    977 # ----------------------------
    978 log.debug(f"{self.__class__.__name__}: trainer tearing down")

File [~/miniconda3/envs/matgl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1014](https://file+.vscode-resource.vscode-cdn.net/home/charles/cp_matgl/~/miniconda3/envs/matgl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1014), in Trainer._run_stage(self)
   1012 if self.training:
   1013     with isolate_rng():
-> 1014         self._run_sanity_check()
   1015     with torch.autograd.set_detect_anomaly(self._detect_anomaly):
   1016         self.fit_loop.run()

File [~/miniconda3/envs/matgl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1043](https://file+.vscode-resource.vscode-cdn.net/home/charles/cp_matgl/~/miniconda3/envs/matgl/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1043), in Trainer._run_sanity_check(self)
   1040 call._call_callback_hooks(self, "on_sanity_check_start")
   1042 # run eval step
-> 1043 val_loop.run()
   1045 call._call_callback_hooks(self, "on_sanity_check_end")
   1047 # reset logger connector

File [~/miniconda3/envs/matgl/lib/python3.9/site-packages/pytorch_lightning/loops/utilities.py:177](https://file+.vscode-resource.vscode-cdn.net/home/charles/cp_matgl/~/miniconda3/envs/matgl/lib/python3.9/site-packages/pytorch_lightning/loops/utilities.py:177), in _no_grad_context.._decorator(self, *args, **kwargs)
    175     context_manager = torch.no_grad
    176 with context_manager():
--> 177     return loop_run(self, *args, **kwargs)

File [~/miniconda3/envs/matgl/lib/python3.9/site-packages/pytorch_lightning/loops/evaluation_loop.py:115](https://file+.vscode-resource.vscode-cdn.net/home/charles/cp_matgl/~/miniconda3/envs/matgl/lib/python3.9/site-packages/pytorch_lightning/loops/evaluation_loop.py:115), in _EvaluationLoop.run(self)
    113     previous_dataloader_idx = dataloader_idx
    114     # run step hooks
--> 115     self._evaluation_step(batch, batch_idx, dataloader_idx)
    116 except StopIteration:
    117     # this needs to wrap the `*_step` call too (not just `next`) for `dataloader_iter` support
    118     break

File [~/miniconda3/envs/matgl/lib/python3.9/site-packages/pytorch_lightning/loops/evaluation_loop.py:375](https://file+.vscode-resource.vscode-cdn.net/home/charles/cp_matgl/~/miniconda3/envs/matgl/lib/python3.9/site-packages/pytorch_lightning/loops/evaluation_loop.py:375), in _EvaluationLoop._evaluation_step(self, batch, batch_idx, dataloader_idx)
    372 self.batch_progress.increment_started()
    374 hook_name = "test_step" if trainer.testing else "validation_step"
--> 375 output = call._call_strategy_hook(trainer, hook_name, *step_kwargs.values())
    377 self.batch_progress.increment_processed()
    379 hook_name = "on_test_batch_end" if trainer.testing else "on_validation_batch_end"

File [~/miniconda3/envs/matgl/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py:291](https://file+.vscode-resource.vscode-cdn.net/home/charles/cp_matgl/~/miniconda3/envs/matgl/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py:291), in _call_strategy_hook(trainer, hook_name, *args, **kwargs)
    288     return None
    290 with trainer.profiler.profile(f"[Strategy]{trainer.strategy.__class__.__name__}.{hook_name}"):
--> 291     output = fn(*args, **kwargs)
    293 # restore current_fx when nested context
    294 pl_module._current_fx_name = prev_fx_name

File [~/miniconda3/envs/matgl/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py:379](https://file+.vscode-resource.vscode-cdn.net/home/charles/cp_matgl/~/miniconda3/envs/matgl/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py:379), in Strategy.validation_step(self, *args, **kwargs)
    377 with self.precision_plugin.val_step_context():
    378     assert isinstance(self.model, ValidationStep)
--> 379     return self.model.validation_step(*args, **kwargs)

File [~/cp_matgl/matgl/matgl/utils/training.py:59](https://file+.vscode-resource.vscode-cdn.net/home/charles/cp_matgl/~/cp_matgl/matgl/matgl/utils/training.py:59), in MatglLightningModuleMixin.validation_step(self, batch, batch_idx)
     52 def validation_step(self, batch: tuple, batch_idx: int):
     53     """Validation step.
     54 
     55     Args:
     56         batch: Data batch.
     57         batch_idx: Batch index.
     58     """
---> 59     results, batch_size = self.step(batch)  # type: ignore
     60     self.log_dict(  # type: ignore
     61         {f"val_{key}": val for key, val in results.items()},
     62         batch_size=batch_size,
   (...)
     65         prog_bar=True,
     66     )
     67     return results["Total_Loss"]

File [~/cp_matgl/matgl/matgl/utils/training.py:334](https://file+.vscode-resource.vscode-cdn.net/home/charles/cp_matgl/~/cp_matgl/matgl/matgl/utils/training.py:334), in PotentialLightningModule.step(self, batch)
    332 torch.set_grad_enabled(True)
    333 g, l_g, state_attr, energies, forces, stresses = batch
--> 334 e, f, s, _ = self(g=g, state_attr=state_attr, l_g=l_g)
    335 f = f.to(torch.float)
    336 preds: tuple = (e, f, s)

File [~/miniconda3/envs/matgl/lib/python3.9/site-packages/torch/nn/modules/module.py:1501](https://file+.vscode-resource.vscode-cdn.net/home/charles/cp_matgl/~/miniconda3/envs/matgl/lib/python3.9/site-packages/torch/nn/modules/module.py:1501), in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File [~/cp_matgl/matgl/matgl/utils/training.py:322](https://file+.vscode-resource.vscode-cdn.net/home/charles/cp_matgl/~/cp_matgl/matgl/matgl/utils/training.py:322), in PotentialLightningModule.forward(self, g, l_g, state_attr)
    313 def forward(self, g: dgl.DGLGraph, l_g: dgl.DGLGraph | None = None, state_attr: torch.Tensor | None = None):
    314     """Args:
    315         g: dgl Graph
    316         l_g: Line graph
   (...)
    320         energy, force, stress, h
    321     """
--> 322     e, f, s, h = self.model(g=g, l_g=l_g, state_attr=state_attr)
    323     return e, f.float(), s, h

File [~/miniconda3/envs/matgl/lib/python3.9/site-packages/torch/nn/modules/module.py:1501](https://file+.vscode-resource.vscode-cdn.net/home/charles/cp_matgl/~/miniconda3/envs/matgl/lib/python3.9/site-packages/torch/nn/modules/module.py:1501), in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File [~/cp_matgl/matgl/matgl/apps/pes.py:75](https://file+.vscode-resource.vscode-cdn.net/home/charles/cp_matgl/~/cp_matgl/matgl/matgl/apps/pes.py:75), in Potential.forward(self, g, state_attr, l_g)
     73 if self.calc_forces:
     74     g.ndata["pos"].requires_grad_(True)
---> 75 total_energies = self.data_std * self.model(g=g, state_attr=state_attr, l_g=l_g) + self.data_mean
     76 if self.element_refs is not None:
     77     property_offset = torch.squeeze(self.element_refs(g))

File [~/miniconda3/envs/matgl/lib/python3.9/site-packages/torch/nn/modules/module.py:1501](https://file+.vscode-resource.vscode-cdn.net/home/charles/cp_matgl/~/miniconda3/envs/matgl/lib/python3.9/site-packages/torch/nn/modules/module.py:1501), in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File [~/cp_matgl/matgl/matgl/models/_m3gnet.py:246](https://file+.vscode-resource.vscode-cdn.net/home/charles/cp_matgl/~/cp_matgl/matgl/matgl/models/_m3gnet.py:246), in M3GNet.forward(self, g, state_attr, l_g)
    243 g.edata["bond_vec"] = bond_vec.to(g.device)
    244 g.edata["bond_dist"] = bond_dist.to(g.device)
--> 246 expanded_dists = self.bond_expansion(g.edata["bond_dist"])
    247 if l_g is None:
    248     l_g = create_line_graph(g, self.threebody_cutoff)

File [~/miniconda3/envs/matgl/lib/python3.9/site-packages/torch/nn/modules/module.py:1501](https://file+.vscode-resource.vscode-cdn.net/home/charles/cp_matgl/~/miniconda3/envs/matgl/lib/python3.9/site-packages/torch/nn/modules/module.py:1501), in Module._call_impl(self, *args, **kwargs)
   1496 # If we don't have any hooks, we want to skip the rest of the logic in
   1497 # this function, and just call forward.
   1498 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1499         or _global_backward_pre_hooks or _global_backward_hooks
   1500         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1501     return forward_call(*args, **kwargs)
   1502 # Do not call functions when jit is used
   1503 full_backward_hooks, non_full_backward_hooks = [], []

File [~/cp_matgl/matgl/matgl/layers/_bond.py:65](https://file+.vscode-resource.vscode-cdn.net/home/charles/cp_matgl/~/cp_matgl/matgl/matgl/layers/_bond.py:65), in BondExpansion.forward(self, bond_dist)
     56 def forward(self, bond_dist: torch.Tensor):
     57     """Forward.
     58 
     59     Args:
   (...)
     63     bond_basis: Radial basis functions
     64     """
---> 65     bond_basis = self.rbf(bond_dist)
     66     return bond_basis

File [~/cp_matgl/matgl/matgl/layers/_basis.py:103](https://file+.vscode-resource.vscode-cdn.net/home/charles/cp_matgl/~/cp_matgl/matgl/matgl/layers/_basis.py:103), in SphericalBesselFunction.__call__(self, r)
    101 if self.smooth:
    102     return self._call_smooth_sbf(r)
--> 103 return self._call_sbf(r)

File [~/cp_matgl/matgl/matgl/layers/_basis.py:121](https://file+.vscode-resource.vscode-cdn.net/home/charles/cp_matgl/~/cp_matgl/matgl/matgl/layers/_basis.py:121), in SphericalBesselFunction._call_sbf(self, r)
    118     func = self.funcs[i]
    119     func_add1 = self.funcs[i + 1]
    120     results.append(
--> 121         func(r_c[:, None] * root[None, :] [/](https://file+.vscode-resource.vscode-cdn.net/) self.cutoff) * factor [/](https://file+.vscode-resource.vscode-cdn.net/) torch.abs(func_add1(root[None, :]))
    122     )
    123 return torch.cat(results, axis=1)

File [~/miniconda3/envs/matgl/lib/python3.9/site-packages/torch/utils/_device.py:62](https://file+.vscode-resource.vscode-cdn.net/home/charles/cp_matgl/~/miniconda3/envs/matgl/lib/python3.9/site-packages/torch/utils/_device.py:62), in DeviceContext.__torch_function__(self, func, types, args, kwargs)
     60 if func in _device_constructors() and kwargs.get('device') is None:
     61     kwargs['device'] = self.device
---> 62 return func(*args, **kwargs)

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
The available device is cuda
Retrieving ThermoDoc documents: 100%
407/407 [00:00<00:00, 56328.18it/s]
407 downloaded from MP.
100%|██████████| 407/407 [00:11<00:00, 35.99it/s]

SmallBearC Aug 3, 2023

I think it might be more helpful if you paste your script here

shyuep Aug 3, 2023
Maintainer

Please read my response below.

chiku-parida Aug 7, 2023
Author

train_matgl_cuda.pdf

I am sorry for the previous wrong file.

Please look at the attached file. I have also considered cuda as default device @shyuep

SmallBearC Aug 7, 2023

The settings you set here are incorrect：
trainer = pl.Trainer(max_epochs=100, accelerator="cpu", logger=logger) #
The accelerator should be 'cuda'. When you set the default device to GPU at the beginning of the code, it will cause data to be loaded on the GPU, while you are training with a CPU, which can lead to conflicts

shyuep · 2023-07-31T14:21:00Z

shyuep
Jul 31, 2023
Maintainer

Pls refer to pytorch documentation on setting the default device. See https://pytorch.org/docs/stable/notes/cuda.html

In general, if you wrap your entire code with with torch.device("cuda") or start with torch.set_default_device("cuda"), that will make sure that all tensors are on the right device for GPU training.

0 replies

KarimElgammal · 2023-08-13T22:35:02Z

KarimElgammal
Aug 13, 2023

Hi! which python version you recommend to work with MatGL and cuda?

15 replies

shyuep Aug 14, 2023
Maintainer

I don't think that matters.

KarimElgammal Aug 14, 2023

I tried with 3.9 and the same DGLError, should I try with clonning the repo then pip install -e . ?

KarimElgammal Aug 14, 2023

I tried with 'pip install' and 'python setup install' for the GPU support and still the same DGLError
meanwhile, do you have a docker image or a DOCKERFILE for MatGL with GPU support, then it can work out of the box, maybe I can use it instead of either pip install or source install?

shyuep Aug 14, 2023
Maintainer

We do not have a docker image. Also, a lot depends on the specific GPU and OS you are using. But we have yet to encounter issues with pip install or conda install.

KarimElgammal Aug 14, 2023

sounds great, so is there recommendations about the environment how should it be?

KarimElgammal · 2023-08-20T10:58:32Z

KarimElgammal
Aug 20, 2023

Training a MEGNet Formation Energy Model with PyTorch Lightning

maybe stupid question!
thanks for the tutorial, I was wondering if I want to use trained model in this tutorial in a similar fashion to what you have done in the pretrained_models, I only get those files ("dgl_graph.bin", "dgl_line_graph.bin", "state_attr.pt", "labels.json") while in the pretrained_models the files are (model.json, model.pt, state.pt)
to use this new model, I need to generate model.pt, right?

2 replies

shyuep Aug 20, 2023
Maintainer

You need to save the model. Just use model.save(). The files you see during training are intermediate cache files containing the generated graphs and attributes and labels. They are not used when actually running the model for predictions.

KarimElgammal Aug 20, 2023

@shyuep much appreciated your reply, I want also to thank you for uploading your talks over youtube, I learned allot!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training with GPU #119

{{title}}

Replies: 4 comments 22 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Training with GPU #119

Replies: 4 comments · 22 replies

chiku-parida Aug 3, 2023 Author

shyuep Aug 3, 2023 Maintainer

chiku-parida Aug 7, 2023 Author

shyuep Jul 31, 2023 Maintainer

shyuep Aug 14, 2023 Maintainer

shyuep Aug 14, 2023 Maintainer

shyuep Aug 20, 2023 Maintainer

Replies: 4 comments 22 replies

chiku-parida Aug 3, 2023
Author

shyuep Aug 3, 2023
Maintainer

chiku-parida Aug 7, 2023
Author

shyuep
Jul 31, 2023
Maintainer

shyuep Aug 14, 2023
Maintainer

shyuep Aug 14, 2023
Maintainer

shyuep Aug 20, 2023
Maintainer