Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda device shows not available on EC2 instance #181

Open
Singh-sid930 opened this issue Apr 4, 2024 · 2 comments
Open

Cuda device shows not available on EC2 instance #181

Singh-sid930 opened this issue Apr 4, 2024 · 2 comments

Comments

@Singh-sid930
Copy link

I am trying to run the training on an EC2 instance which has Cuda capabilities.

`+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 |
| N/A 26C P0 25W / 70W | 2MiB / 15360MiB | 6% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+`

and

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

However I keep getting the following error when I try to run the training. Note that I have run colmap on the same instance which seemed to have run fine using GPU

Optimizing ../output
Output folder: ../output [04/04 06:07:14]
Tensorboard not available: not logging progress [04/04 06:07:14]
Reading camera 1006/1006 [04/04 06:07:18]
Loading Training Cameras [04/04 06:07:19]
Traceback (most recent call last):
  File "train.py", line 219, in <module>
    training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint, args.debug_from)
  File "train.py", line 35, in training
    scene = Scene(dataset, gaussians)
  File "/home/ubuntu/workspace/gaussian-splatting/scene/__init__.py", line 73, in __init__
    self.train_cameras[resolution_scale] = cameraList_from_camInfos(scene_info.train_cameras, resolution_scale, args)
  File "/home/ubuntu/workspace/gaussian-splatting/utils/camera_utils.py", line 58, in cameraList_from_camInfos
    camera_list.append(loadCam(args, id, c, resolution_scale))
  File "/home/ubuntu/workspace/gaussian-splatting/utils/camera_utils.py", line 52, in loadCam
    image_name=cam_info.image_name, uid=id, data_device=args.data_device)
  File "/home/ubuntu/workspace/gaussian-splatting/scene/cameras.py", line 39, in __init__
    self.original_image = image.clamp(0.0, 1.0).to(self.data_device)
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.```
@Singh-sid930
Copy link
Author

What is even stranger is that if I run python console within the terminal in the conda environment, for the same line of code cuda devices work but not when it is ran through the train.py script

(gaussian_splatting) ubuntu@ip-172-31-5-223:~/workspace/gaussian-splatting$ python train.py -s ../data/images/images_1 --data_device cpu
Optimizing 
Output folder: ./output/fc4ede38-7 [05/04 06:17:32]
Tensorboard not available: not logging progress [05/04 06:17:32]
Reading camera 1006/1006 [05/04 06:17:36]
Loading Training Cameras [05/04 06:17:37]
Traceback (most recent call last):
  File "train.py", line 219, in <module>
    training(lp.extract(args), op.extract(args), pp.extract(args), args.test_iterations, args.save_iterations, args.checkpoint_iterations, args.start_checkpoint, args.debug_from)
  File "train.py", line 35, in training
    scene = Scene(dataset, gaussians)
  File "/home/ubuntu/workspace/gaussian-splatting/scene/__init__.py", line 73, in __init__
    self.train_cameras[resolution_scale] = cameraList_from_camInfos(scene_info.train_cameras, resolution_scale, args)
  File "/home/ubuntu/workspace/gaussian-splatting/utils/camera_utils.py", line 58, in cameraList_from_camInfos
    camera_list.append(loadCam(args, id, c, resolution_scale))
  File "/home/ubuntu/workspace/gaussian-splatting/utils/camera_utils.py", line 52, in loadCam
    image_name=cam_info.image_name, uid=id, data_device=args.data_device)
  File "/home/ubuntu/workspace/gaussian-splatting/scene/cameras.py", line 53, in __init__
    rand_a = torch.rand((3,3)).cuda()
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
(gaussian_splatting) ubuntu@ip-172-31-5-223:~/workspace/gaussian-splatting$ python 
Python 3.7.13 (default, Oct 18 2022, 18:57:03) 
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch as torch
>>> rand_a = torch.rand((3,3)).cuda()
>>> rand_a
tensor([[0.3751, 0.8623, 0.5603],
        [0.7451, 0.6077, 0.7982],
        [0.9916, 0.0623, 0.5862]], device='cuda:0')
        ```

@Singh-sid930
Copy link
Author

strangely what fixed the error was changing the number of file limits by using :
unlimit -n 2048 Which led to the realization that my images were a little too big and large in number 1024*1960 resolution 1000 images. And cuda would crash out of memory.
After decreasing the size of the images by almost 8th, things have gotten much better with both colmap and training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant