Training instance segmentation #28

marekjaszuk · 2019-08-25T23:21:02Z

Hi,
I'm trying to reproduce Your results in instance segmentation, using the scripts that you delivered (train_maskyolo_step1.sh and 2). I did everything according to the instruction. The scripts work, and produce some loss values. The first phase of training gives me the roi rectangles, that identify the persons, but after the second phase of training I get no result (neither roi or masks). What do you think I could do wrong? Did you get the result using the same scripts?

leon-liangwu · 2019-08-26T03:00:30Z

@marekjaszuk
Hi. Can I see your log, please?

marekjaszuk · 2019-08-26T12:07:49Z

Here are the logs:
train_step1.log
train_step2.log

lucasjinreal · 2019-08-26T13:14:06Z

BTW, the original prototxt 1 and 2 has different num_class:

1：

layer {
  name: "mask_score"
  type: "Convolution"
  bottom: "pool5_2_conv6_relu" #
  top: "mask_score"
  param { lr_mult: 1.0 decay_mult: 1.0 }
  param { lr_mult: 2.0 decay_mult: 0 }
  convolution_param {
    num_output: 18  # 9 affordance classes + 1 background
    kernel_size: 1 pad: 0 
    weight_filler {type: "gaussian" std: 0.01 } #weight_filler { type: "xavier" }
    bias_filler { type: "constant" value: 0 }
  }
}

2：

layer {
  name: "mask_score"
  type: "Convolution"
  bottom: "pool5_2_conv6_relu" #
  top: "mask_score"
  param { lr_mult: 1.0 decay_mult: 1.0 }
  param { lr_mult: 2.0 decay_mult: 0 }
  convolution_param {
    num_output: 2  # 9 affordance classes + 1 background
    kernel_size: 1 pad: 0 
    weight_filler {type: "gaussian" std: 0.01 } #weight_filler { type: "xavier" }
    bias_filler { type: "constant" value: 0 }
  }
}

How do u able to resume weights? (I got error when try to resume step1 trained weights to step2)

lucasjinreal · 2019-08-26T13:14:48Z

What's the first prototxt 18 stands for? Isn't that only 1 class? What if wanna training on 2 or more classes?

leon-liangwu · 2019-08-26T13:44:36Z

@marekjaszuk @jinfagang
Yes, thanks. @jinfagang You are right. There is a mistake in step1.prototxt. As we only segment human bodies, the num_output should be set to 2.
@marekjaszuk I have updated the model.tgz, please try agrain.

marekjaszuk · 2019-08-26T22:22:19Z

The step1 of training went fine, but you refer to pva_solver.prototxt file in the train_maskyolo_step2.sh
Where can I find this .prototxt file?

lucasjinreal · 2019-08-27T01:52:50Z

@leon-liangwu Thanks for replying.... I have one more question still. You set class to 1, but, in our data, we should have 80 classes stored? How can do specific the only person class as training?
Have u tried training on more than 1 class? What the result would be?

leon-liangwu · 2019-08-27T02:35:49Z

@jinfagang You can refer to script/createdata_xxx.py. I select all the person targets to create the lmdb.

leon-liangwu · 2019-08-27T02:36:48Z

@marekjaszuk Yes, just use the solver_step2.prototxt to replace pva_solver.prototxt. That is a mistake. I have updated the model.tgz.
Thanks.

lucasjinreal · 2019-08-27T03:11:08Z

If change create_dataxxx.py to select multi classes such as car and person, and also edit classes num in prototxt, would that able to work?

leon-liangwu · 2019-08-27T03:12:50Z

@jinfagang
Yes, sure.

lucasjinreal · 2019-08-27T03:34:16Z

@leon-liangwu But if using classes more than person, then it should not have keypoints, only box and mask, does it compatible to using kps_data_layer to load data?

leon-liangwu · 2019-08-27T03:35:48Z

@jinfagang You can use scripts/createdata_mask_only.py this file to generate lmdb with boxes and masks.

lucasjinreal · 2019-08-27T10:05:20Z

@leon-liangwu I have tried mask training, I can not reproduce your effect. 40000 iterations for step1 and 40000 iterations for step2:

leon-liangwu · 2019-08-27T11:01:08Z

@jinfagang Actually, you need to change batch_size: 1 in KpsBoxData layer and prop_num: 1 in DecodeRois layer.
Here 'prop_num' is the mounts of proposals. It is usually set to 1batchsize to 2batchsize, depending on the amounts of targets according to your dataset.
I suggest that if you set these two values both as 64, you will get good results.
Also, multi-gpu training is suggested, so you can use bigger batchsize.

marekjaszuk · 2019-08-28T23:35:03Z

@leon-liangwu thank you for the last suggestions. I finally got satisfying results for masking person shapes on data generated by the createdata_mask_kps.py script. Now I'm trying to reproduce the result on data generated by the createdata_mask_only.py script. The script seems to work fine. But after running the training I get the following error: Error in `../../caffe-maskyolo/build/tools/caffe': corrupted size vs. prev_size: 0x00007ed8900fea70 ***
I ve tried various combinations of classes starting from single 'person' class, several randomly selected classes, and all classes from the coco dataset. In all cases I got the same error. Any suggestions? I've included the full error log, and a sample training log.

train_cat8.log
error.log

lucasjinreal · 2019-08-29T02:28:38Z

After changes prop_num to 128, larged batchsize, I still can not get replicated result of mask.

the result shows only detection boxes but I can not see any mask....

lucasjinreal · 2019-08-29T07:16:25Z

@marekjaszuk Have u got any mask result? What did u changed? Original prototxt definitely can not produces good result except you change somewhere like prop_num

marekjaszuk · 2019-08-29T10:56:39Z

@jinfagang yes, I used prop_num=32 and the same batch_size. With 64 the GPU memory was not sufficient. I'm trying to run multi-GPU training. I installed NCCL, and rebuilt the program but running the training fails. Were you able to run multi-GPU training?

Below are my results of mask training on your image. I got them after 40000 iterations. They are not perfect, but I think longer training would improve them.

lucasjinreal · 2019-08-29T11:09:14Z

@marekjaszuk That's wired, I using prop_num=128 and batch_size=2

Why batch_size so effect result? It shouldn't be... Did u trained step1 40000 and step2 40000?

leon-liangwu · 2019-08-29T11:11:16Z

@marekjaszuk
Hi, I think you can try to make prop_num and batch_size equal to 128 train again.

lucasjinreal · 2019-08-29T11:26:56Z

Batchsize 32 also out of memory on GTX1080ti..,..

marekjaszuk · 2019-08-29T12:15:00Z

Ok. I finally ran the multi-GPU training. But with batch_size and prop_num=32. With larger values, like 64, it causes out of GPU memory error. I have RTX 2080 Ti (11GB), Titan Xp (12GB), and Titan X (12GB), so as it seems running with larger batch_size would require a GPU with larger memory. I was training on COCO2017 train dataset.

lucasjinreal · 2019-08-30T06:14:55Z

@marekjaszuk How do u able to using multi-gpu training? Does it support nccl?

lucasjinreal · 2019-08-30T07:31:45Z

OK, I am able build caffe with nccl and training with multi-gpu support.

I am setting batchsize maxium 20 and prop_num 128

marekjaszuk · 2019-09-03T01:01:58Z

I'm still trying to train the network on data generated with the create_mask_only.py script. Unfortunately the training fails. I was trying to modify the prototxt with the model to eliminate elements related to kps, but this did not improve the situation. Do you have any working model possible to train on the data without kps?

lucasjinreal · 2019-09-03T02:22:59Z

@marekjaszuk Does instance segmentation needs kps information? that would be tricky. If only using mask and detection to training, results bad?

leon-liangwu · 2019-09-03T08:23:15Z

@jinfagang Instance segmentation does not need kps information at all.
If you use create_mask_only.py this script to generate lmdb you need to change the data layer in the prototxt accordingly.

lucasjinreal · 2019-09-19T09:24:44Z

Hi, I try to training on a new model with multi class mask instance segmentation.

I got some error when change prototxt, could u help which part needs change?

# currently I have changed KpsBoxData into MaskBoxData
# and I want training on 5 classes

layer {
  name: "decode_roi"
  type: "DecodeRois"
  bottom: "conv_out"
  bottom: "label"
  top: "rois"
  top: "roi_labels"
  top: "bbox_targets"
  top: "bbox_inside_weights"
  top: "bbox_outside_weights"
  top: "mask_targets"
  top: "kps_targets"
  decode_rois_param {
    num_class: 5
    num_object: 3
    
    prop_num: 128

    with_mask: true
    with_kps: false
    sigma: 0.0
    mask_w: 320
    mask_h: 224
    target_size: 28
    anchor_x: 1.33
    anchor_x: 2.86
    anchor_x: 7.25
    anchor_y: 3.05
    anchor_y: 7.02
    anchor_y: 10.73

    thresh: 0.2

    net_w: 320
    net_h: 224
  }
}

layer {
  name: "region_loss"
  type: "RegionLoss"
  bottom: "conv_out"
  bottom: "label"
  top: "region_loss"
  loss_weight: 1.0
  region_loss_param {
    num_class: 5
    num_object: 3
    object_scale: 5.0
    noobject_scale: 1.0
    class_scale: 1.0
    coord_scale: 1.0
    softmax: false
    rescore: false
    with_mask: true
    mask_w: 320
    mask_h: 224
    anchor_x: 1.33
    anchor_x: 2.86
    anchor_x: 7.25
    anchor_y: 3.05
    anchor_y: 7.02
    anchor_y: 10.73
    thresh: 0.6
    bias_match: true
  }
}

layer {
  name: "mask_score"
  type: "Convolution"
  bottom: "pool5_2_conv6_relu" #
  top: "mask_score"
  param { lr_mult: 1.0 decay_mult: 1.0 }
  param { lr_mult: 2.0 decay_mult: 0 }
  convolution_param {
    num_output: 6  # 5 affordance classes + 1 background
    kernel_size: 1 pad: 0 
    weight_filler {type: "gaussian" std: 0.01 } #weight_filler { type: "xavier" }
    bias_filler { type: "constant" value: 0 }
  }
}

these are where I changed. But I got shape mistach in regionloss layer:

I0919 17:20:49.254940  5132 net.cpp:84] Creating Layer region_loss
I0919 17:20:49.254946  5132 net.cpp:406] region_loss <- conv_out_conv_out_0_split_0
I0919 17:20:49.254954  5132 net.cpp:406] region_loss <- label_data_1_split_0
I0919 17:20:49.254966  5132 net.cpp:380] region_loss -> region_loss
I0919 17:20:49.255014  5132 region_loss_layer.cpp:72] mask w: 320 mask_h_: 224 truths_: 180
F0919 17:20:49.255053  5132 region_loss_layer.cpp:95] Check failed: outputs_ == bottom[0]->count(1) (8400 vs. 4200) 
*** Check failure stack trace: ***
    @     0x7feb0aab2cd4  google::LogMessage::Fail()
    @     0x7feb0aab2c18  google::LogMessage::SendToLog()
    @     0x7feb0aab2554  google::LogMessage::Flush()
    @     0x7feb0aab5f8b  google::LogMessageFatal::~LogMessageFatal()
    @     0x7feb0b19ff53  caffe::RegionLossLayer<>::LayerSetUp()
    @     0x7feb0b21de2c  caffe::Net<>::Init()
    @     0x7feb0b22053e  caffe::Net<>::Net()
    @     0x7feb0b234f6a  caffe::Solver<>::InitTrainNet()
    @     0x7feb0b236535  caffe::Solver<>::Init()
    @     0x7feb0b23684f  caffe::Solver<>::Solver()
    @     0x7feb0b255f91  caffe::Creator_SGDSolver<>()
    @           0x40e00a  train()
    @           0x40aec7  main
    @     0x7feb0993c830  __libc_start_main
    @           0x40b9a9  _start
    @              (nil)  (unknown)
Aborted (core dumped)

leon-liangwu · 2019-11-18T16:31:21Z

@jinfagang I have updated the repo to train segmentation more easily. Thanks.

lucasjinreal · 2019-11-19T03:21:32Z

@leon-liangwu does it support train on various classes for now?

leon-liangwu · 2019-11-19T03:31:46Z

@jinfagang the box, of course, has a category label. The instance only shows a binary may. So it certainly supports train on tasks with multi classes. But you need to modify your label to coco format if you want to use the scripts provided.

lucasjinreal · 2019-11-19T03:40:55Z

I mean, multi classes simultaneously on a single mask model. but does the loss function support it? How to tried it?

leon-liangwu · 2019-11-19T03:52:40Z

@jinfagang If you are referring to AffordanceNet, it is not supported now. Multi-class object detection with instance mask is supported.
You need to modify the code.

lucasjinreal · 2019-11-19T03:55:10Z

Oh, yes, mask with boxes.
Do u have an example model for train? (hard to see which params should be edit in prototxt model file)

leon-liangwu · 2019-11-19T04:08:17Z

AffordanceNet is not supported now.
If you want to detect multi classes you just need to modify regionloss, decoderoi and the layer before them which is the feature map in both steps.

layer {
  name: "conv_out"
  type: "Convolution"
  bottom: "conv6/0/ds1/det"
  top: "conv_out"
  param {
    lr_mult: 0.0
    decay_mult: 0.0
  }
  param {
    lr_mult: 0.0
    decay_mult: 0.0
  }
  convolution_param {
    num_output: 30  # (num_class+ 1 + 4) * num_object
    pad: 0
    kernel_size: 1
    stride: 1
    weight_filler {
      type: "msra"
    }
  }
}
layer {
  name: "decode_roi"
  type: "DecodeRois"
  bottom: "conv_out"
  bottom: "label"
  top: "rois"
  top: "roi_labels"
  top: "bbox_targets"
  top: "bbox_inside_weights"
  top: "bbox_outside_weights"
  top: "mask_targets"
  top: "kps_targets"
  decode_rois_param {
    num_class: 5
    num_object: 3
    
    prop_num: 128

    with_mask: true
    with_kps: false
    sigma: 0.0
    mask_w: 320
    mask_h: 224
    target_size: 28
    anchor_x: 1.33
    anchor_x: 2.86
    anchor_x: 7.25
    anchor_y: 3.05
    anchor_y: 7.02
    anchor_y: 10.73

    thresh: 0.2

    net_w: 320
    net_h: 224
  }
}

layer {
  name: "region_loss"
  type: "RegionLoss"
  bottom: "conv_out"
  bottom: "label"
  top: "region_loss"
  loss_weight: 1.0
  region_loss_param {
    num_class: 5
    num_object: 3
    object_scale: 5.0
    noobject_scale: 1.0
    class_scale: 1.0
    coord_scale: 1.0
    softmax: false
    rescore: false
    with_mask: true
    mask_w: 320
    mask_h: 224
    anchor_x: 1.33
    anchor_x: 2.86
    anchor_x: 7.25
    anchor_y: 3.05
    anchor_y: 7.02
    anchor_y: 10.73
    thresh: 0.6
    bias_match: true
  }
}

lucasjinreal · 2019-11-19T04:12:14Z

what's the num_object means? does it need modify?

leon-liangwu · 2019-11-19T04:14:59Z

num_object means the num of anchors in region loss and decode roi layers.
num_object: 3 so there are three sets of anchors below.

    anchor_x: 1.33
    anchor_x: 2.86
    anchor_x: 7.25
    anchor_y: 3.05
    anchor_y: 7.02
    anchor_y: 10.73

leon-liangwu · 2019-11-19T04:36:56Z

@jinfagang Hi, if you have any other problems with this repo, please feel free to let me know.
If you can train the model successfully, please close the issue to make me informed that the issue has been settled.

lucasjinreal · 2019-11-19T06:18:13Z

@leon-liangwu I will let u know when I start training, I am afraid it will got some issue to make model run on multi-classes

leon-liangwu · 2019-11-19T09:03:14Z

Yes, please.

monjha · 2020-01-31T01:55:42Z

I think # proposals are done in this way:
So Roialign will have dimensions: [batch_size,prop_num,width,height,channels].
Algo: Separate positive_boxes: if boxes_score>threshold (0.5 #threshold used), and negative boxes (boxes_score<threshold).
#out_boxes = prop_num.
if prop_num>positive_boxes then add # negative boxes = -#positive + #prop_num
So, I think after couple of iterations you can check how many predicted boxes has boxes_score> threshold and how many negative boxes you are adding. If the ratio of positive_boxes/prop_num is too less then probably reducing prop_num will not hurt the accuracy too much?

leon-liangwu closed this as completed Sep 5, 2019

leon-liangwu reopened this Nov 18, 2019

Training instance segmentation #28

Training instance segmentation #28

Comments

marekjaszuk commented Aug 25, 2019

leon-liangwu commented Aug 26, 2019

marekjaszuk commented Aug 26, 2019

lucasjinreal commented Aug 26, 2019

lucasjinreal commented Aug 26, 2019

leon-liangwu commented Aug 26, 2019

marekjaszuk commented Aug 26, 2019

lucasjinreal commented Aug 27, 2019

leon-liangwu commented Aug 27, 2019

leon-liangwu commented Aug 27, 2019

lucasjinreal commented Aug 27, 2019

leon-liangwu commented Aug 27, 2019

lucasjinreal commented Aug 27, 2019

leon-liangwu commented Aug 27, 2019

lucasjinreal commented Aug 27, 2019

leon-liangwu commented Aug 27, 2019

marekjaszuk commented Aug 28, 2019

lucasjinreal commented Aug 29, 2019

lucasjinreal commented Aug 29, 2019

marekjaszuk commented Aug 29, 2019

lucasjinreal commented Aug 29, 2019

leon-liangwu commented Aug 29, 2019

lucasjinreal commented Aug 29, 2019

marekjaszuk commented Aug 29, 2019 • edited

lucasjinreal commented Aug 30, 2019

lucasjinreal commented Aug 30, 2019

marekjaszuk commented Sep 3, 2019

lucasjinreal commented Sep 3, 2019

leon-liangwu commented Sep 3, 2019

lucasjinreal commented Sep 19, 2019

leon-liangwu commented Nov 18, 2019 • edited

lucasjinreal commented Nov 19, 2019

leon-liangwu commented Nov 19, 2019

lucasjinreal commented Nov 19, 2019

leon-liangwu commented Nov 19, 2019

lucasjinreal commented Nov 19, 2019 • edited

leon-liangwu commented Nov 19, 2019 • edited

lucasjinreal commented Nov 19, 2019

leon-liangwu commented Nov 19, 2019

leon-liangwu commented Nov 19, 2019

lucasjinreal commented Nov 19, 2019

leon-liangwu commented Nov 19, 2019

monjha commented Jan 31, 2020 • edited

marekjaszuk commented Aug 29, 2019 •

edited

leon-liangwu commented Nov 18, 2019 •

edited

lucasjinreal commented Nov 19, 2019 •

edited

leon-liangwu commented Nov 19, 2019 •

edited

monjha commented Jan 31, 2020 •

edited