Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: matrix contains invalid numeric entries #6

Open
zhaoyangwei123 opened this issue Jan 7, 2023 · 7 comments
Open

ValueError: matrix contains invalid numeric entries #6

zhaoyangwei123 opened this issue Jan 7, 2023 · 7 comments

Comments

@zhaoyangwei123
Copy link

Hello, @LiWentomng
I tried to reproduce your paper box2mask, but I had the following problems and the model had a very large loss at the beginning of training. How to solve it?

2023-01-07 12:27:40,129 - mmdet - INFO - Iter [50/368750] lr: 5.000e-06, eta: 3 days, 14:44:25, time: 0.847, data_time: 0.050, memory: 6779, loss_cls: 9.3236, loss_project: 6.2381, loss_levelset: 0.0710, d0.loss_cls: 9.0557, d0.loss_project: 5.5436, d0.loss_levelset: 0.0670, d1.loss_cls: 9.3925, d1.loss_project: 5.5199, d1.loss_levelset: 0.0640, d2.loss_cls: 9.1847, d2.loss_project: 5.7577, d2.loss_levelset: 0.0549, d3.loss_cls: 9.3142, d3.loss_project: 5.8749, d3.loss_levelset: 0.0656, d4.loss_cls: 9.4000, d4.loss_project: 5.8713, d4.loss_levelset: 0.0596, d5.loss_cls: 9.0998, d5.loss_project: 6.2049, d5.loss_levelset: 0.0682, d6.loss_cls: 9.1544, d6.loss_project: 6.1733, d6.loss_levelset: 0.0779, d7.loss_cls: 9.0938, d7.loss_project: 6.3329, d7.loss_levelset: 0.0836, d8.loss_cls: 8.7211, d8.loss_project: 6.4827, d8.loss_levelset: 0.0856, loss: 152.4366, grad_norm: 307.3523

Traceback (most recent call last):
File "./tools/train.py", line 242, in
main()
File "./tools/train.py", line 231, in main
train_detector(
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/apis/train.py", line 244, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 144, in run
iter_runner(iter_loaders[i], **kwargs)
File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 64, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 63, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/detectors/base.py", line 248, in train_step
losses = self(**data)
File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 116, in new_func
return old_func(*args, **kwargs)
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/detectors/base.py", line 172, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/detectors/maskformer.py", line 104, in forward_train
losses = self.panoptic_head.forward_train(x, img_metas, gt_bboxes,
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/dense_heads/box2mask_head.py", line 440, in forward_train
losses = self.loss(all_cls_scores, all_mask_preds, all_lst_feats,gt_labels, gt_masks,
File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 205, in new_func
return old_func(*args, **kwargs)
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/dense_heads/box2mask_head.py", line 203, in loss
losses_cls, loss_project, loss_levelset = multi_apply(
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/core/utils/misc.py", line 30, in multi_apply
return tuple(map(list, zip(*map_results)))
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/dense_heads/box2mask_head.py", line 239, in loss_single
num_total_pos,num_total_neg) = self.get_targets(cls_scores_list, mask_preds_list,
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/dense_heads/box2mask_head.py", line 142, in get_targets
neg_inds_list) = multi_apply(self._get_target_single, cls_scores_list,
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/core/utils/misc.py", line 30, in multi_apply
return tuple(map(list, zip(*map_results)))
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/dense_heads/box2mask_head.py", line 167, in _get_target_single
assign_result = self.assigner.assign(cls_score, mask_pred,
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/core/bbox/assigners/mask_hungarian_assigner.py", line 119, in assign
matched_row_inds, matched_col_inds = linear_sum_assignment(cost)
ValueError: matrix contains invalid numeric entries

@LiWentomng
Copy link
Owner

Hello@zhaoyangwei123
The large loss is normal for box2mask. I upload my training log file for coco (r-101). You can refer to it.

I didn't encounter the above problem. It seems the problem of assigner. Are you training for the COCO or your dataset? I have test the codes and configs, which are normal for COCO and VOC.

@zhaoyangwei123
Copy link
Author

@LiWentomng I am training for the coco on 8 NVIDIA RTX2080TI GPU. So I changed the image size from (1024, 1024) to (800, 800) with batch=1 and num_workers=0. I don't know if it's because I've changed these parameters.

@LiWentomng
Copy link
Owner

@zhaoyangwei123
I suggest you firstly try VOC with RTX2080TI GPU. VOC needs the less GPU memory with less training time. The VOC link with coco-format annotaions is here.

I guess that batch_size=1 may incur this problem. I will check this problem.

@LiWentomng
Copy link
Owner

@zhaoyangwei123
I have fixed this issue. When batch_size=1, the loss values will appear nan value.
You can try the current codes. Please note when batch_size=1, the learning rate lr and training step and max_iters (50e by default) need to be changed proportionally.
Any further questions can be disscuessed.

@zhaoyangwei123
Copy link
Author

@LiWentomng
Thank you very much for your answer, but when I run your new code, I have the following problem:
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/utils/registry.py", line 69, in build_from_cfg
return obj_cls(**args)
File "/home/ubuntu/wzy/BoxInstSeg/BoxInstSeg-main/mmdet/datasets/pipelines/transforms.py", line 767, in init
assert crop_size[0] > 0 and crop_size[1] > 0
TypeError: '>' not supported between instances of 'tuple' and 'int'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/utils/registry.py", line 69, in build_from_cfg
return obj_cls(**args)
File "/home/ubuntu/wzy/BoxInstSeg/BoxInstSeg-main/mmdet/datasets/custom.py", line 129, in init
self.pipeline = Compose(pipeline)
File "/home/ubuntu/wzy/BoxInstSeg/BoxInstSeg-main/mmdet/datasets/pipelines/compose.py", line 23, in init
transform = build_from_cfg(transform, PIPELINES)
File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/utils/registry.py", line 72, in build_from_cfg
raise type(e)(f'{obj_cls.name}: {e}')
TypeError: RandomCrop: '>' not supported between instances of 'tuple' and 'int'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "tools/train.py", line 242, in
main()
File "tools/train.py", line 218, in main
datasets = [build_dataset(cfg.data.train)]
File "/home/ubuntu/wzy/BoxInstSeg/BoxInstSeg-main/mmdet/datasets/builder.py", line 82, in build_dataset
dataset = build_from_cfg(cfg, DATASETS, default_args)
File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/utils/registry.py", line 72, in build_from_cfg
raise type(e)(f'{obj_cls.name}: {e}')
TypeError: CocoDataset: RandomCrop: '>' not supported between instances of 'tuple' and 'int'

I verified boxlevelset and boxinst, both work fine, so I think there may be some errors in the box2mask code you uploaded.

@LiWentomng
Copy link
Owner

@zhaoyangwei123
When did this erro appear? At the starting or during training process?
I have test the code and config with 800x800 and bs=1, and the training work fine.
According to the reporting error, the format of image size is right as image_size = (800, 800) in your config ?
Can you share your config information?

@zhaoyangwei123
Copy link
Author

@LiWentomng
Hello, my error came at the beginning of the training and I have the following config,image_size = (1024,1024), samples_per_gpu=1, workers_per_gpu=0, lr=0.00005. The other configuration is unchanged.
Because I found that there are errors reported on multiple GPUs, I considered solving the problem on a single GPU first. On a single 2080TI, the image size can be changed without change.
I located the error in line 767 of transforms.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants