Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training instance segmentation #28

Open
marekjaszuk opened this issue Aug 25, 2019 · 42 comments
Open

Training instance segmentation #28

marekjaszuk opened this issue Aug 25, 2019 · 42 comments

Comments

@marekjaszuk
Copy link

Hi,
I'm trying to reproduce Your results in instance segmentation, using the scripts that you delivered (train_maskyolo_step1.sh and 2). I did everything according to the instruction. The scripts work, and produce some loss values. The first phase of training gives me the roi rectangles, that identify the persons, but after the second phase of training I get no result (neither roi or masks). What do you think I could do wrong? Did you get the result using the same scripts?

@leon-liangwu
Copy link
Owner

@marekjaszuk
Hi. Can I see your log, please?

@marekjaszuk
Copy link
Author

Here are the logs:
train_step1.log
train_step2.log

@lucasjinreal
Copy link

BTW, the original prototxt 1 and 2 has different num_class:

1:

layer {
  name: "mask_score"
  type: "Convolution"
  bottom: "pool5_2_conv6_relu" #
  top: "mask_score"
  param { lr_mult: 1.0 decay_mult: 1.0 }
  param { lr_mult: 2.0 decay_mult: 0 }
  convolution_param {
    num_output: 18  # 9 affordance classes + 1 background
    kernel_size: 1 pad: 0 
    weight_filler {type: "gaussian" std: 0.01 } #weight_filler { type: "xavier" }
    bias_filler { type: "constant" value: 0 }
  }
}

2:

layer {
  name: "mask_score"
  type: "Convolution"
  bottom: "pool5_2_conv6_relu" #
  top: "mask_score"
  param { lr_mult: 1.0 decay_mult: 1.0 }
  param { lr_mult: 2.0 decay_mult: 0 }
  convolution_param {
    num_output: 2  # 9 affordance classes + 1 background
    kernel_size: 1 pad: 0 
    weight_filler {type: "gaussian" std: 0.01 } #weight_filler { type: "xavier" }
    bias_filler { type: "constant" value: 0 }
  }
}

How do u able to resume weights? (I got error when try to resume step1 trained weights to step2)

@lucasjinreal
Copy link

What's the first prototxt 18 stands for? Isn't that only 1 class? What if wanna training on 2 or more classes?

@leon-liangwu
Copy link
Owner

@marekjaszuk @jinfagang
Yes, thanks. @jinfagang You are right. There is a mistake in step1.prototxt. As we only segment human bodies, the num_output should be set to 2.
@marekjaszuk I have updated the model.tgz, please try agrain.

@marekjaszuk
Copy link
Author

The step1 of training went fine, but you refer to pva_solver.prototxt file in the train_maskyolo_step2.sh
Where can I find this .prototxt file?

@lucasjinreal
Copy link

@leon-liangwu Thanks for replying.... I have one more question still. You set class to 1, but, in our data, we should have 80 classes stored? How can do specific the only person class as training?
Have u tried training on more than 1 class? What the result would be?

@leon-liangwu
Copy link
Owner

@jinfagang You can refer to script/createdata_xxx.py. I select all the person targets to create the lmdb.

@leon-liangwu
Copy link
Owner

@marekjaszuk Yes, just use the solver_step2.prototxt to replace pva_solver.prototxt. That is a mistake. I have updated the model.tgz.
Thanks.

@lucasjinreal
Copy link

If change create_dataxxx.py to select multi classes such as car and person, and also edit classes num in prototxt, would that able to work?

@leon-liangwu
Copy link
Owner

@jinfagang
Yes, sure.

@lucasjinreal
Copy link

@leon-liangwu But if using classes more than person, then it should not have keypoints, only box and mask, does it compatible to using kps_data_layer to load data?

@leon-liangwu
Copy link
Owner

@jinfagang You can use scripts/createdata_mask_only.py this file to generate lmdb with boxes and masks.

@lucasjinreal
Copy link

@leon-liangwu I have tried mask training, I can not reproduce your effect. 40000 iterations for step1 and 40000 iterations for step2:

image

@leon-liangwu
Copy link
Owner

@jinfagang Actually, you need to change batch_size: 1 in KpsBoxData layer and prop_num: 1 in DecodeRois layer.
Here 'prop_num' is the mounts of proposals. It is usually set to 1batchsize to 2batchsize, depending on the amounts of targets according to your dataset.
I suggest that if you set these two values both as 64, you will get good results.
Also, multi-gpu training is suggested, so you can use bigger batchsize.

@marekjaszuk
Copy link
Author

@leon-liangwu thank you for the last suggestions. I finally got satisfying results for masking person shapes on data generated by the createdata_mask_kps.py script. Now I'm trying to reproduce the result on data generated by the createdata_mask_only.py script. The script seems to work fine. But after running the training I get the following error: Error in `../../caffe-maskyolo/build/tools/caffe': corrupted size vs. prev_size: 0x00007ed8900fea70 ***
I ve tried various combinations of classes starting from single 'person' class, several randomly selected classes, and all classes from the coco dataset. In all cases I got the same error. Any suggestions? I've included the full error log, and a sample training log.

train_cat8.log
error.log

@lucasjinreal
Copy link

After changes prop_num to 128, larged batchsize, I still can not get replicated result of mask.

the result shows only detection boxes but I can not see any mask....

image

@lucasjinreal
Copy link

@marekjaszuk Have u got any mask result? What did u changed? Original prototxt definitely can not produces good result except you change somewhere like prop_num

@marekjaszuk
Copy link
Author

@jinfagang yes, I used prop_num=32 and the same batch_size. With 64 the GPU memory was not sufficient. I'm trying to run multi-GPU training. I installed NCCL, and rebuilt the program but running the training fails. Were you able to run multi-GPU training?

Below are my results of mask training on your image. I got them after 40000 iterations. They are not perfect, but I think longer training would improve them.

test

@lucasjinreal
Copy link

@marekjaszuk That's wired, I using prop_num=128 and batch_size=2

Why batch_size so effect result? It shouldn't be... Did u trained step1 40000 and step2 40000?

@leon-liangwu
Copy link
Owner

@marekjaszuk
Hi, I think you can try to make prop_num and batch_size equal to 128 train again.

@lucasjinreal
Copy link

Batchsize 32 also out of memory on GTX1080ti..,..

@marekjaszuk
Copy link
Author

marekjaszuk commented Aug 29, 2019

Ok. I finally ran the multi-GPU training. But with batch_size and prop_num=32. With larger values, like 64, it causes out of GPU memory error. I have RTX 2080 Ti (11GB), Titan Xp (12GB), and Titan X (12GB), so as it seems running with larger batch_size would require a GPU with larger memory. I was training on COCO2017 train dataset.

@lucasjinreal
Copy link

@marekjaszuk How do u able to using multi-gpu training? Does it support nccl?

@lucasjinreal
Copy link

OK, I am able build caffe with nccl and training with multi-gpu support.

I am setting batchsize maxium 20 and prop_num 128

@marekjaszuk
Copy link
Author

I'm still trying to train the network on data generated with the create_mask_only.py script. Unfortunately the training fails. I was trying to modify the prototxt with the model to eliminate elements related to kps, but this did not improve the situation. Do you have any working model possible to train on the data without kps?

@lucasjinreal
Copy link

@marekjaszuk Does instance segmentation needs kps information? that would be tricky. If only using mask and detection to training, results bad?

@leon-liangwu
Copy link
Owner

@jinfagang Instance segmentation does not need kps information at all.
If you use create_mask_only.py this script to generate lmdb you need to change the data layer in the prototxt accordingly.

@lucasjinreal
Copy link

Hi, I try to training on a new model with multi class mask instance segmentation.

I got some error when change prototxt, could u help which part needs change?

# currently I have changed KpsBoxData into MaskBoxData
# and I want training on 5 classes

layer {
  name: "decode_roi"
  type: "DecodeRois"
  bottom: "conv_out"
  bottom: "label"
  top: "rois"
  top: "roi_labels"
  top: "bbox_targets"
  top: "bbox_inside_weights"
  top: "bbox_outside_weights"
  top: "mask_targets"
  top: "kps_targets"
  decode_rois_param {
    num_class: 5
    num_object: 3
    
    prop_num: 128

    with_mask: true
    with_kps: false
    sigma: 0.0
    mask_w: 320
    mask_h: 224
    target_size: 28
    anchor_x: 1.33
    anchor_x: 2.86
    anchor_x: 7.25
    anchor_y: 3.05
    anchor_y: 7.02
    anchor_y: 10.73

    thresh: 0.2

    net_w: 320
    net_h: 224
  }
}

layer {
  name: "region_loss"
  type: "RegionLoss"
  bottom: "conv_out"
  bottom: "label"
  top: "region_loss"
  loss_weight: 1.0
  region_loss_param {
    num_class: 5
    num_object: 3
    object_scale: 5.0
    noobject_scale: 1.0
    class_scale: 1.0
    coord_scale: 1.0
    softmax: false
    rescore: false
    with_mask: true
    mask_w: 320
    mask_h: 224
    anchor_x: 1.33
    anchor_x: 2.86
    anchor_x: 7.25
    anchor_y: 3.05
    anchor_y: 7.02
    anchor_y: 10.73
    thresh: 0.6
    bias_match: true
  }
}

layer {
  name: "mask_score"
  type: "Convolution"
  bottom: "pool5_2_conv6_relu" #
  top: "mask_score"
  param { lr_mult: 1.0 decay_mult: 1.0 }
  param { lr_mult: 2.0 decay_mult: 0 }
  convolution_param {
    num_output: 6  # 5 affordance classes + 1 background
    kernel_size: 1 pad: 0 
    weight_filler {type: "gaussian" std: 0.01 } #weight_filler { type: "xavier" }
    bias_filler { type: "constant" value: 0 }
  }
}

these are where I changed. But I got shape mistach in regionloss layer:

I0919 17:20:49.254940  5132 net.cpp:84] Creating Layer region_loss
I0919 17:20:49.254946  5132 net.cpp:406] region_loss <- conv_out_conv_out_0_split_0
I0919 17:20:49.254954  5132 net.cpp:406] region_loss <- label_data_1_split_0
I0919 17:20:49.254966  5132 net.cpp:380] region_loss -> region_loss
I0919 17:20:49.255014  5132 region_loss_layer.cpp:72] mask w: 320 mask_h_: 224 truths_: 180
F0919 17:20:49.255053  5132 region_loss_layer.cpp:95] Check failed: outputs_ == bottom[0]->count(1) (8400 vs. 4200) 
*** Check failure stack trace: ***
    @     0x7feb0aab2cd4  google::LogMessage::Fail()
    @     0x7feb0aab2c18  google::LogMessage::SendToLog()
    @     0x7feb0aab2554  google::LogMessage::Flush()
    @     0x7feb0aab5f8b  google::LogMessageFatal::~LogMessageFatal()
    @     0x7feb0b19ff53  caffe::RegionLossLayer<>::LayerSetUp()
    @     0x7feb0b21de2c  caffe::Net<>::Init()
    @     0x7feb0b22053e  caffe::Net<>::Net()
    @     0x7feb0b234f6a  caffe::Solver<>::InitTrainNet()
    @     0x7feb0b236535  caffe::Solver<>::Init()
    @     0x7feb0b23684f  caffe::Solver<>::Solver()
    @     0x7feb0b255f91  caffe::Creator_SGDSolver<>()
    @           0x40e00a  train()
    @           0x40aec7  main
    @     0x7feb0993c830  __libc_start_main
    @           0x40b9a9  _start
    @              (nil)  (unknown)
Aborted (core dumped)

@leon-liangwu
Copy link
Owner

leon-liangwu commented Nov 18, 2019

@jinfagang I have updated the repo to train segmentation more easily. Thanks.

@leon-liangwu leon-liangwu reopened this Nov 18, 2019
@lucasjinreal
Copy link

@leon-liangwu does it support train on various classes for now?

@leon-liangwu
Copy link
Owner

@jinfagang the box, of course, has a category label. The instance only shows a binary may. So it certainly supports train on tasks with multi classes. But you need to modify your label to coco format if you want to use the scripts provided.

@lucasjinreal
Copy link

I mean, multi classes simultaneously on a single mask model. but does the loss function support it? How to tried it?

@leon-liangwu
Copy link
Owner

@jinfagang If you are referring to AffordanceNet, it is not supported now. Multi-class object detection with instance mask is supported.
You need to modify the code.

@lucasjinreal
Copy link

lucasjinreal commented Nov 19, 2019

Oh, yes, mask with boxes.
Do u have an example model for train? (hard to see which params should be edit in prototxt model file)

@leon-liangwu
Copy link
Owner

leon-liangwu commented Nov 19, 2019

AffordanceNet is not supported now.
If you want to detect multi classes you just need to modify regionloss, decoderoi and the layer before them which is the feature map in both steps.

layer {
  name: "conv_out"
  type: "Convolution"
  bottom: "conv6/0/ds1/det"
  top: "conv_out"
  param {
    lr_mult: 0.0
    decay_mult: 0.0
  }
  param {
    lr_mult: 0.0
    decay_mult: 0.0
  }
  convolution_param {
    num_output: 30  # (num_class+ 1 + 4) * num_object
    pad: 0
    kernel_size: 1
    stride: 1
    weight_filler {
      type: "msra"
    }
  }
}
layer {
  name: "decode_roi"
  type: "DecodeRois"
  bottom: "conv_out"
  bottom: "label"
  top: "rois"
  top: "roi_labels"
  top: "bbox_targets"
  top: "bbox_inside_weights"
  top: "bbox_outside_weights"
  top: "mask_targets"
  top: "kps_targets"
  decode_rois_param {
    num_class: 5
    num_object: 3
    
    prop_num: 128

    with_mask: true
    with_kps: false
    sigma: 0.0
    mask_w: 320
    mask_h: 224
    target_size: 28
    anchor_x: 1.33
    anchor_x: 2.86
    anchor_x: 7.25
    anchor_y: 3.05
    anchor_y: 7.02
    anchor_y: 10.73

    thresh: 0.2

    net_w: 320
    net_h: 224
  }
}

layer {
  name: "region_loss"
  type: "RegionLoss"
  bottom: "conv_out"
  bottom: "label"
  top: "region_loss"
  loss_weight: 1.0
  region_loss_param {
    num_class: 5
    num_object: 3
    object_scale: 5.0
    noobject_scale: 1.0
    class_scale: 1.0
    coord_scale: 1.0
    softmax: false
    rescore: false
    with_mask: true
    mask_w: 320
    mask_h: 224
    anchor_x: 1.33
    anchor_x: 2.86
    anchor_x: 7.25
    anchor_y: 3.05
    anchor_y: 7.02
    anchor_y: 10.73
    thresh: 0.6
    bias_match: true
  }
}

@lucasjinreal
Copy link

what's the num_object means? does it need modify?

@leon-liangwu
Copy link
Owner

num_object means the num of anchors in region loss and decode roi layers.
num_object: 3 so there are three sets of anchors below.

    anchor_x: 1.33
    anchor_x: 2.86
    anchor_x: 7.25
    anchor_y: 3.05
    anchor_y: 7.02
    anchor_y: 10.73

@leon-liangwu
Copy link
Owner

@jinfagang Hi, if you have any other problems with this repo, please feel free to let me know.
If you can train the model successfully, please close the issue to make me informed that the issue has been settled.

@lucasjinreal
Copy link

@leon-liangwu I will let u know when I start training, I am afraid it will got some issue to make model run on multi-classes

@leon-liangwu
Copy link
Owner

Yes, please.

@monjha
Copy link

monjha commented Jan 31, 2020

I think # proposals are done in this way:
So Roialign will have dimensions: [batch_size,prop_num,width,height,channels].
Algo: Separate positive_boxes: if boxes_score>threshold (0.5 #threshold used), and negative boxes (boxes_score<threshold).
#out_boxes = prop_num.
if prop_num>positive_boxes then add # negative boxes = -#positive + #prop_num
So, I think after couple of iterations you can check how many predicted boxes has boxes_score> threshold and how many negative boxes you are adding. If the ratio of positive_boxes/prop_num is too less then probably reducing prop_num will not hurt the accuracy too much?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants