Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU memory and multi-GPU mode #22

Open
fguney opened this issue Sep 30, 2020 · 9 comments
Open

GPU memory and multi-GPU mode #22

fguney opened this issue Sep 30, 2020 · 9 comments

Comments

@fguney
Copy link

fguney commented Sep 30, 2020

Hi,

what kind of GPU have you used to train the model? I cannot run the code (neither training nor evaluation) on a 2-GPU machine - each 12 GB without running into a memory problem. Is there a multi-GPU support planned?

Thanks!

@Newdxz
Copy link

Newdxz commented Oct 9, 2020

I have a same question . I dont run in GTX2080Ti with 12GB

Hi,

what kind of GPU have you used to train the model? I cannot run the code (neither training nor evaluation) on a 2-GPU machine - each 12 GB without running into a memory problem. Is there a multi-GPU support planned?

Thanks!

@guillembraso
Copy link
Collaborator

guillembraso commented Nov 1, 2020

Hi!

Sorry for my late reply, and sorry for the inconvenience this might have caused you.

I used a Quadro P5000 GPU to train and evaluate the models. It has ~16GB. However, for me, training uses only ~4-5 GB of memory. For the OOM errors during training, could you tell me if they happen at some point during training or you get them straight after starting? In the former case, what could be happening is that a batch unluckily samples graphs with many nodes and edges. One possible workaround would be to look at the collate_fn method in the dataloader, and prevent such batches from being fully fed to the model. One way to do that would be by replacing the DataLoader definition line (line 76 in src/mot_neural_solver/pl_module/pl_module.py) by this:

from torch.utils.data import DataLoader # IMPORTANT: use pytorch native dataloader, not the one from pt-geometric
from torch_geometric.data import Batch

MAX_EDGES = 100000 # Needs to be adjusted to the available GPU memory

def limit_batch_size(batch): 
    """Keeps as many graphs as possible in the batch with the given budget"""
    edges_in_batch = 0
    for i, graph in enumerate(batch, 1):
        edges_in_batch += graph.num_edges
        if edges_in_batch > MAX_EDGES:
            i -= 1
            break
    return batch[:i]

collate_fn = lambda batch: Batch.from_data_list(limit_batch_size(batch), [])
train_dataloader = DataLoader(dataset, 
                              batch_size = self.hparams['train_params']['batch_size'], 
                              shuffle = True if mode == 'train' else False, 
                              num_workers=self.hparams['train_params']['num_workers'], 
                              collate_fn=collate_fn)

As for inference, I have made a major update to the code. Inference should now comfortably run on GPUs with under 10 GB of memory. So it'd be great if you could confirm whether you are still experiencing these issues after pulling from the repo.

As for multi-gpu support @fguney, it is not currently planned. It turns out that implementing it is not very straightforward, since the interactions between pytorch lightning and pytorch geometric's multi-gpu functionality are a bit messy. Since in principle the model should be trainable with smaller gpus (hopefully with the solution above), I'd like to avoid going into it. But if it's the only way, then I guess I could do it!

Best,

Guillem

@Newdxz
Copy link

Newdxz commented Nov 2, 2020

Hi!

Sorry for my late reply, and sorry for the inconvenience this might have caused you.

I used a Quadro P5000 GPU to train and evaluate the models. It has ~16GB. However, for me, training uses only ~4-5 GB of memory. For the OOM errors during training, could you tell me if they happen at some point during training or you get them straight after starting? In the former case, what could be happening is that a batch unluckily samples graphs with many nodes and edges. One possible workaround would be to look at the collate_fn method in the dataloader, and prevent such batches from being fully fed to the model. One way to do that would be by replacing the DataLoader definition line (line 76 in src/mot_neural_solver/pl_module/pl_module.py) by this:

from torch.utils.data import DataLoader # IMPORTANT: use pytorch native dataloader, not the one from pt-geometric
from torch_geometric.data import Batch

MAX_EDGES = 100000 # Needs to be adjusted to the available GPU memory

def limit_batch_size(batch): 
    """Keeps as many graphs as possible in the batch with the given budget"""
    edges_in_batch = 0
    for i, graph in enumerate(batch, 1):
        edges_in_batch += graph.num_edges
        if edges_in_batch > MAX_EDGES:
            i -= 1
            break
    return batch[:i]

collate_fn = lambda batch: Batch.from_data_list(limit_batch_size(batch), [])
train_dataloader = DataLoader(dataset, 
                              batch_size = self.hparams['train_params']['batch_size'], 
                              shuffle = True if mode == 'train' else False, 
                              num_workers=self.hparams['train_params']['num_workers'], 
                              collate_fn=collate_fn)

As for inference, I have made a major update to the code. Inference should now comfortably run on GPUs with under 10 GB of memory. So it'd be great if you could confirm whether you are still experiencing these issues after pulling from the repo.

As for multi-gpu support @fguney, it is not currently planned. It turns out that implementing it is not very straightforward, since the interactions between pytorch lightning and pytorch geometric's multi-gpu functionality are a bit messy. Since in principle the model should be trainable with smaller gpus (hopefully with the solution above), I'd like to avoid going into it. But if it's the only way, then I guess I could do it!

Best,

Guillem

Thank you very much for your reply:
After I pull your code, it can be run in the test, but in Evaluation, the OOM problem still exists
Below is the running error
(mot_neural_solver) cust@cust-Precision-7920-Tower:~/dxz/mot_neural_solver$ python scripts/evaluate.py WARNING - evaluate - No observers have been added to this run INFO - evaluate - Running command 'main' INFO - evaluate - Started Successfully loaded pretrained weights from "/home/cust/dxz/mot_neural_solver/output/trained_models/reid/resnet50_market_cuhk_duke.tar-232" ** The following layers are discarded due to unmatched keys or layer size: ['classifier.weight', 'classifier.bias'] Loading processed dets for sequence TUD-Crossing from /home/cust/dxz/mot_neural_solver/data/2DMOT2015/test/TUD-Crossing/processed_data/det/tracktor_prepr_det.pkl Detections for sequence PETS09-S2L2 need to be processed. Starting processing Finished processing detections for seq PETS09-S2L2. Result was stored at /home/cust/dxz/mot_neural_solver/data/2DMOT2015/test/PETS09-S2L2/processed_data/det/tracktor_prepr_det.pkl Found existing stored node embeddings. Deleting them and replacing them for new ones Found existing stored reid embeddings. Deleting them and replacing them for new ones Computing embeddings for 5270 detections ERROR - evaluate - Failed after 0:00:17! Traceback (most recent calls WITHOUT Sacred internals): File "scripts/evaluate.py", line 39, in main test_dataset = model.test_dataset() File "/home/cust/dxz/mot_neural_solver/src/mot_neural_solver/pl_module/pl_module.py", line 80, in test_dataset return self._get_data('test', return_data_loader = return_data_loader) File "/home/cust/dxz/mot_neural_solver/src/mot_neural_solver/pl_module/pl_module.py", line 58, in _get_data logger=None) File "/home/cust/dxz/mot_neural_solver/src/mot_neural_solver/data/mot_graph_dataset.py", line 33, in __init__ self.seq_det_dfs, self.seq_info_dicts, self.seq_names = self._load_seq_dfs(seqs_to_retrieve) File "/home/cust/dxz/mot_neural_solver/src/mot_neural_solver/data/mot_graph_dataset.py", line 82, in _load_seq_dfs seq_det_df = seq_processor.load_or_process_detections() File "/home/cust/dxz/mot_neural_solver/src/mot_neural_solver/data/seq_processing/seq_processor.py", line 381, in load_or_process_detections seq_det_df = self.process_detections() File "/home/cust/dxz/mot_neural_solver/src/mot_neural_solver/data/seq_processing/seq_processor.py", line 347, in process_detections self._store_embeddings() File "/home/cust/dxz/mot_neural_solver/src/mot_neural_solver/data/seq_processing/seq_processor.py", line 307, in _store_embeddings node_out, reid_out = self.cnn_model(bboxes.cuda()) File "/home/cust/anaconda3/envs/mot_neural_solver/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) File "/home/cust/dxz/mot_neural_solver/src/mot_neural_solver/models/resnet.py", line 272, in forward f = self.featuremaps(x) File "/home/cust/dxz/mot_neural_solver/src/mot_neural_solver/models/resnet.py", line 265, in featuremaps x = self.layer1(x) File "/home/cust/anaconda3/envs/mot_neural_solver/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) File "/home/cust/anaconda3/envs/mot_neural_solver/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward input = module(input) File "/home/cust/anaconda3/envs/mot_neural_solver/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) File "/home/cust/dxz/mot_neural_solver/src/mot_neural_solver/models/resnet.py", line 117, in forward identity = self.downsample(x) File "/home/cust/anaconda3/envs/mot_neural_solver/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) File "/home/cust/anaconda3/envs/mot_neural_solver/lib/python3.6/site-packages/torch/nn/modules/container.py", line 100, in forward input = module(input) File "/home/cust/anaconda3/envs/mot_neural_solver/lib/python3.6/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) File "/home/cust/anaconda3/envs/mot_neural_solver/lib/python3.6/site-packages/torch/nn/modules/batchnorm.py", line 106, in forward exponential_average_factor, self.eps) File "/home/cust/anaconda3/envs/mot_neural_solver/lib/python3.6/site-packages/torch/nn/functional.py", line 1923, in batch_norm training, momentum, eps, torch.backends.cudnn.enabled RuntimeError: CUDA out of memory. Tried to allocate 2.44 GiB (GPU 0; 10.76 GiB total capacity; 6.05 GiB already allocated; 1.82 GiB free; 7.90 GiB reserved in total by PyTorch)

@guillembraso
Copy link
Collaborator

Hi @Newdxz. It seems like you run out of GPU memory when CNN embeddings are being stored. I believe this is happening because the batch size being used for the CNN is too large. Can you try setting dataset_params.img_batch_size=3000? You can do that by either changing the corresponding entry in configs/tracking_cfg.yaml, or simply running python scripts/evaluate.py with dataset_params.img_batch_size=3000

@Newdxz
Copy link

Newdxz commented Nov 2, 2020

dataset_params.img_batch_size=3000

Thank you for your reply: When I try to set dataset_params.img_batch_size=3000, the problem still exists, and I modify img_batch_size to 50, the problem still exists. Is this related to the previous steps? Thank you

@guillembraso
Copy link
Collaborator

Can you please send me a screenshot (no copy-paste please) of your entire output when you set img_batch_size to 50? (I wanna see how the configuration gets printed out). Thanks!

@Newdxz
Copy link

Newdxz commented Nov 2, 2020

Can you please send me a screenshot (no copy-paste please) of your entire output when you set img_batch_size to 50? (I wanna see how the configuration gets printed out). Thanks!

sure
thank you very much !!
1
2

@guillembraso
Copy link
Collaborator

I see! I believe that the problem was that the config was being loaded from the checkpoint, and not your config file/command line options. It should be fixed now. Could you please pull again and run the same command?

@Newdxz
Copy link

Newdxz commented Nov 2, 2020

I see! I believe that the problem was that the config was being loaded from the checkpoint, and not your config file/command line options. It should be fixed now. Could you please pull again and run the same command?

Thanks, with your help, it works normally. Thank you again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants