ValueError: matrix contains invalid numeric entries #6

zhaoyangwei123 · 2023-01-07T04:41:55Z

Hello, @LiWentomng
I tried to reproduce your paper box2mask, but I had the following problems and the model had a very large loss at the beginning of training. How to solve it?

2023-01-07 12:27:40,129 - mmdet - INFO - Iter [50/368750] lr: 5.000e-06, eta: 3 days, 14:44:25, time: 0.847, data_time: 0.050, memory: 6779, loss_cls: 9.3236, loss_project: 6.2381, loss_levelset: 0.0710, d0.loss_cls: 9.0557, d0.loss_project: 5.5436, d0.loss_levelset: 0.0670, d1.loss_cls: 9.3925, d1.loss_project: 5.5199, d1.loss_levelset: 0.0640, d2.loss_cls: 9.1847, d2.loss_project: 5.7577, d2.loss_levelset: 0.0549, d3.loss_cls: 9.3142, d3.loss_project: 5.8749, d3.loss_levelset: 0.0656, d4.loss_cls: 9.4000, d4.loss_project: 5.8713, d4.loss_levelset: 0.0596, d5.loss_cls: 9.0998, d5.loss_project: 6.2049, d5.loss_levelset: 0.0682, d6.loss_cls: 9.1544, d6.loss_project: 6.1733, d6.loss_levelset: 0.0779, d7.loss_cls: 9.0938, d7.loss_project: 6.3329, d7.loss_levelset: 0.0836, d8.loss_cls: 8.7211, d8.loss_project: 6.4827, d8.loss_levelset: 0.0856, loss: 152.4366, grad_norm: 307.3523

Traceback (most recent call last):
File "./tools/train.py", line 242, in
main()
File "./tools/train.py", line 231, in main
train_detector(
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/apis/train.py", line 244, in train_detector
runner.run(data_loaders, cfg.workflow)
File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 144, in run
iter_runner(iter_loaders[i], **kwargs)
File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/runner/iter_based_runner.py", line 64, in train
outputs = self.model.train_step(data_batch, self.optimizer, **kwargs)
File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/parallel/distributed.py", line 63, in train_step
output = self.module.train_step(*inputs[0], **kwargs[0])
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/detectors/base.py", line 248, in train_step
losses = self(**data)
File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 116, in new_func
return old_func(*args, **kwargs)
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/detectors/base.py", line 172, in forward
return self.forward_train(img, img_metas, **kwargs)
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/detectors/maskformer.py", line 104, in forward_train
losses = self.panoptic_head.forward_train(x, img_metas, gt_bboxes,
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/dense_heads/box2mask_head.py", line 440, in forward_train
losses = self.loss(all_cls_scores, all_mask_preds, all_lst_feats,gt_labels, gt_masks,
File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/runner/fp16_utils.py", line 205, in new_func
return old_func(*args, **kwargs)
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/dense_heads/box2mask_head.py", line 203, in loss
losses_cls, loss_project, loss_levelset = multi_apply(
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/core/utils/misc.py", line 30, in multi_apply
return tuple(map(list, zip(*map_results)))
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/dense_heads/box2mask_head.py", line 239, in loss_single
num_total_pos,num_total_neg) = self.get_targets(cls_scores_list, mask_preds_list,
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/dense_heads/box2mask_head.py", line 142, in get_targets
neg_inds_list) = multi_apply(self._get_target_single, cls_scores_list,
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/core/utils/misc.py", line 30, in multi_apply
return tuple(map(list, zip(*map_results)))
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/models/dense_heads/box2mask_head.py", line 167, in _get_target_single
assign_result = self.assigner.assign(cls_score, mask_pred,
File "/home/ubuntu/wzy/BoxInstSeg/mmdet/core/bbox/assigners/mask_hungarian_assigner.py", line 119, in assign
matched_row_inds, matched_col_inds = linear_sum_assignment(cost)
ValueError: matrix contains invalid numeric entries

LiWentomng · 2023-01-07T06:28:28Z

Hello@zhaoyangwei123
The large loss is normal for box2mask. I upload my training log file for coco (r-101). You can refer to it.

I didn't encounter the above problem. It seems the problem of assigner. Are you training for the COCO or your dataset? I have test the codes and configs, which are normal for COCO and VOC.

zhaoyangwei123 · 2023-01-07T08:51:58Z

@LiWentomng I am training for the coco on 8 NVIDIA RTX2080TI GPU. So I changed the image size from (1024, 1024) to (800, 800) with batch=1 and num_workers=0. I don't know if it's because I've changed these parameters.

LiWentomng · 2023-01-07T09:26:53Z

@zhaoyangwei123
I suggest you firstly try VOC with RTX2080TI GPU. VOC needs the less GPU memory with less training time. The VOC link with coco-format annotaions is here.

I guess that batch_size=1 may incur this problem. I will check this problem.

LiWentomng · 2023-01-08T14:22:53Z

@zhaoyangwei123
I have fixed this issue. When batch_size=1, the loss values will appear nan value.
You can try the current codes. Please note when batch_size=1, the learning rate lr and training step and max_iters (50e by default) need to be changed proportionally.
Any further questions can be disscuessed.

zhaoyangwei123 · 2023-01-09T05:56:57Z

@LiWentomng
Thank you very much for your answer, but when I run your new code, I have the following problem：
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/utils/registry.py", line 69, in build_from_cfg
return obj_cls(**args)
File "/home/ubuntu/wzy/BoxInstSeg/BoxInstSeg-main/mmdet/datasets/pipelines/transforms.py", line 767, in init
assert crop_size[0] > 0 and crop_size[1] > 0
TypeError: '>' not supported between instances of 'tuple' and 'int'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/utils/registry.py", line 69, in build_from_cfg
return obj_cls(**args)
File "/home/ubuntu/wzy/BoxInstSeg/BoxInstSeg-main/mmdet/datasets/custom.py", line 129, in init
self.pipeline = Compose(pipeline)
File "/home/ubuntu/wzy/BoxInstSeg/BoxInstSeg-main/mmdet/datasets/pipelines/compose.py", line 23, in init
transform = build_from_cfg(transform, PIPELINES)
File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/utils/registry.py", line 72, in build_from_cfg
raise type(e)(f'{obj_cls.name}: {e}')
TypeError: RandomCrop: '>' not supported between instances of 'tuple' and 'int'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "tools/train.py", line 242, in
main()
File "tools/train.py", line 218, in main
datasets = [build_dataset(cfg.data.train)]
File "/home/ubuntu/wzy/BoxInstSeg/BoxInstSeg-main/mmdet/datasets/builder.py", line 82, in build_dataset
dataset = build_from_cfg(cfg, DATASETS, default_args)
File "/home/ubuntu/miniconda3/envs/boxinstseg/lib/python3.8/site-packages/mmcv/utils/registry.py", line 72, in build_from_cfg
raise type(e)(f'{obj_cls.name}: {e}')
TypeError: CocoDataset: RandomCrop: '>' not supported between instances of 'tuple' and 'int'

I verified boxlevelset and boxinst, both work fine, so I think there may be some errors in the box2mask code you uploaded.

LiWentomng · 2023-01-10T03:23:46Z

@zhaoyangwei123
When did this erro appear? At the starting or during training process?
I have test the code and config with 800x800 and bs=1, and the training work fine.
According to the reporting error, the format of image size is right as image_size = (800, 800) in your config ?
Can you share your config information?

zhaoyangwei123 · 2023-01-10T04:27:35Z

@LiWentomng
Hello, my error came at the beginning of the training and I have the following config,image_size = (1024,1024), samples_per_gpu=1, workers_per_gpu=0, lr=0.00005. The other configuration is unchanged.
Because I found that there are errors reported on multiple GPUs, I considered solving the problem on a single GPU first. On a single 2080TI, the image size can be changed without change.
I located the error in line 767 of transforms.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: matrix contains invalid numeric entries #6

ValueError: matrix contains invalid numeric entries #6

zhaoyangwei123 commented Jan 7, 2023

LiWentomng commented Jan 7, 2023

zhaoyangwei123 commented Jan 7, 2023

LiWentomng commented Jan 7, 2023

LiWentomng commented Jan 8, 2023

zhaoyangwei123 commented Jan 9, 2023

LiWentomng commented Jan 10, 2023

zhaoyangwei123 commented Jan 10, 2023

ValueError: matrix contains invalid numeric entries #6

ValueError: matrix contains invalid numeric entries #6

Comments

zhaoyangwei123 commented Jan 7, 2023

LiWentomng commented Jan 7, 2023

zhaoyangwei123 commented Jan 7, 2023

LiWentomng commented Jan 7, 2023

LiWentomng commented Jan 8, 2023

zhaoyangwei123 commented Jan 9, 2023

LiWentomng commented Jan 10, 2023

zhaoyangwei123 commented Jan 10, 2023