Cuda device shows not available on EC2 instance #181

Singh-sid930 · 2024-04-04T06:07:46Z

I am trying to run the training on an EC2 instance which has Cuda capabilities.

`+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 |
| N/A 26C P0 25W / 70W | 2MiB / 15360MiB | 6% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+`

and

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

However I keep getting the following error when I try to run the training. Note that I have run colmap on the same instance which seemed to have run fine using GPU

Optimizing ../output
Output folder: ../output [04/04 06:07:14]
Tensorboard not available: not logging progress [04/04 06:07:14]
Reading camera 1006/1006 [04/04 06:07:18]
Loading Training Cameras [04/04 06:07:19]
Traceback (most recent call last):
  File "train.py", line 219, in <module>
    training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint, args.debug_from)
  File "train.py", line 35, in training
    scene = Scene(dataset, gaussians)
  File "/home/ubuntu/workspace/gaussian-splatting/scene/__init__.py", line 73, in __init__
    self.train_cameras[resolution_scale] = cameraList_from_camInfos(scene_info.train_cameras, resolution_scale, args)
  File "/home/ubuntu/workspace/gaussian-splatting/utils/camera_utils.py", line 58, in cameraList_from_camInfos
    camera_list.append(loadCam(args, id, c, resolution_scale))
  File "/home/ubuntu/workspace/gaussian-splatting/utils/camera_utils.py", line 52, in loadCam
    image_name=cam_info.image_name, uid=id, data_device=args.data_device)
  File "/home/ubuntu/workspace/gaussian-splatting/scene/cameras.py", line 39, in __init__
    self.original_image = image.clamp(0.0, 1.0).to(self.data_device)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.```

The text was updated successfully, but these errors were encountered:

Singh-sid930 · 2024-04-05T06:19:56Z

What is even stranger is that if I run python console within the terminal in the conda environment, for the same line of code cuda devices work but not when it is ran through the train.py script

(gaussian_splatting) ubuntu@ip-172-31-5-223:~/workspace/gaussian-splatting$ python train.py -s ../data/images/images_1 --data_device cpu
Optimizing 
Output folder: ./output/fc4ede38-7 [05/04 06:17:32]
Tensorboard not available: not logging progress [05/04 06:17:32]
Reading camera 1006/1006 [05/04 06:17:36]
Loading Training Cameras [05/04 06:17:37]
Traceback (most recent call last):
  File "train.py", line 219, in <module>
    training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint, args.debug_from)
  File "train.py", line 35, in training
    scene = Scene(dataset, gaussians)
  File "/home/ubuntu/workspace/gaussian-splatting/scene/__init__.py", line 73, in __init__
    self.train_cameras[resolution_scale] = cameraList_from_camInfos(scene_info.train_cameras, resolution_scale, args)
  File "/home/ubuntu/workspace/gaussian-splatting/utils/camera_utils.py", line 58, in cameraList_from_camInfos
    camera_list.append(loadCam(args, id, c, resolution_scale))
  File "/home/ubuntu/workspace/gaussian-splatting/utils/camera_utils.py", line 52, in loadCam
    image_name=cam_info.image_name, uid=id, data_device=args.data_device)
  File "/home/ubuntu/workspace/gaussian-splatting/scene/cameras.py", line 53, in __init__
    rand_a = torch.rand((3,3)).cuda()
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
(gaussian_splatting) ubuntu@ip-172-31-5-223:~/workspace/gaussian-splatting$ python 
Python 3.7.13 (default, Oct 18 2022, 18:57:03) 
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch as torch
>>> rand_a = torch.rand((3,3)).cuda()
>>> rand_a
tensor([[0.3751, 0.8623, 0.5603],
        [0.7451, 0.6077, 0.7982],
        [0.9916, 0.0623, 0.5862]], device='cuda:0')
        ```

Singh-sid930 · 2024-04-05T07:21:53Z

strangely what fixed the error was changing the number of file limits by using :
unlimit -n 2048 Which led to the realization that my images were a little too big and large in number 1024*1960 resolution 1000 images. And cuda would crash out of memory.
After decreasing the size of the images by almost 8th, things have gotten much better with both colmap and training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda device shows not available on EC2 instance #181

Cuda device shows not available on EC2 instance #181

Singh-sid930 commented Apr 4, 2024

Singh-sid930 commented Apr 5, 2024

Singh-sid930 commented Apr 5, 2024

Cuda device shows not available on EC2 instance #181

Cuda device shows not available on EC2 instance #181

Comments

Singh-sid930 commented Apr 4, 2024

Singh-sid930 commented Apr 5, 2024

Singh-sid930 commented Apr 5, 2024