Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

如何阶段性保存模型?在训练过程中valset的coco_eval的AP一直是0?total_loss较大 #13

Open
5RJ opened this issue Dec 11, 2020 · 13 comments

Comments

@5RJ
Copy link

5RJ commented Dec 11, 2020

作者,你好,我有几个问题想请教一下:

  1. 我发现目前工程只在训练完全结束后才会保存模型,请问如何阶段性保存模型呢?我通过pip install安装了detectron2,随后在detectron2.engine.defaults.py中的DefaultTrainer增加train函数(以期覆盖TrainerBase中的train函数),具体代码如下(基于TrainerBase.train(), 增加了一行print, 以及阶段性保存模型的代码):
    ` def train(self, start_iter: int, max_iter: int):
    """
    Args:
    start_iter, max_iter (int): See docs above
    """
    logger = logging.getLogger(name)
    logger.info("Starting training from iteration {}".format(start_iter))
    import ipdb; ipdb.set_trace()
    self.iter = self.start_iter = start_iter
    self.max_iter = max_iter

     with EventStorage(start_iter) as self.storage:
         try:
             self.before_train()
             print('!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!',start_iter, max_iter)
             for self.iter in range(start_iter, max_iter):
                 self.before_step()
                 self.run_step()
                 self.after_step()
                 if self.iter % 100 == 0:
                     self.checkpointer.save("model_" + str(self.iter+1))
                     
             # self.iter == max_iter can be used by `after_train` to
             # tell whether the training successfully finished or failed
             # due to exceptions.
             self.iter += 1
    
                 
         except Exception:
             logger.exception("Exception during training:")
             raise
         finally:
             self.after_train()
    

`
然而并没有print对应的内容,模型也没保存上,请问正确的打开方式是什么呢?

  1. 现象:在训练过程中valset的coco_eval的AP一直是0。
    环境配置:采用coco/centernet_res50_coco.yaml 进行目标检测任务,数据集按照coco格式处理好,且在xingyizhou发布的CenterNet工程上可以正常训练和测试。
    在centerX上对cfg的修改:
    cfg.DATASETS.TRAIN = ("table_aline_train",)
    cfg.DATASETS.TEST = ("table_aline_val",)
    cfg.DATALOADER.NUM_WORKERS = 2
    cfg.SOLVER.MAX_ITER = 30
    cfg.OUTPUT_DIR = "./output/table_aline"
    cfg.SOLVER.IMS_PER_BATCH = 8
    cfg.SOLVER.BASE_LR = 0.00125
    cfg.INPUT.MAX_SIZE_TRAIN = 1024
    cfg.INPUT.MIN_SIZE_TRAIN = 512

此外在main函数中借助register_coco_instances注册了我的数据集。

用作者提供的run.sh脚本,2块gpu运行。

train: 700+
val: 80+

具体问题
在训练过程中,发现在val set上做coco evaluation时,结果一直都是下图这样:
1211

在迭代了2300+次后,total_loss从1281降到了6.6左右,看inference中生成的框score很多接近1了,但是它们的位置远远超出了图片的尺寸(尺寸参考下面的信息),例如:
{"image_id": 7, "category_id": 1, "bbox": [-120932.8515625, -51244.3125, 250420.453125, 95695.1640625], "score": 1.0}, {"image_id": 7, "category_id": 1, "bbox": [-146367.90625, -59846.8046875, 301889.0625, 119286.0078125], "score": 1.0}

已尝试的debug
对比total_loss相较原始centerNet上的训练情况(loss收敛到0.8左右),我怀疑也许dataloader加载的bbox有些问题,于是将数据集相关信息打印出来了,举个例子:
centerX/modeling/meta_arch/centernet.py 中 CenterNet.forward()里,输出了batched_inputs[0],得到如下结果:
{'file_name': '/mnt/maskrcnn-benchmark/datasets/table_aline/train2017/d-27.png', 'height': 2339, 'width': 1654, 'image_id': 174, 'image': tensor([[[170., 170., 170., ..., 170., 170., 170.],
[170., 170., 170., ..., 170., 170., 170.],
[170., 170., 170., ..., 170., 170., 170.],
...,
[170., 170., 170., ..., 170., 170., 170.],
[170., 170., 170., ..., 170., 170., 170.],
[170., 170., 170., ..., 170., 170., 170.]],

    [[170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.],
     ...,
     [170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.]],

    [[170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.],
     ...,
     [170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.],
     [170., 170., 170.,  ..., 170., 170., 170.]]]), 'instances': Instances(num_instances=2, image_height=723, image_width=512, fields=[gt_boxes: Boxes(tensor([[ 16.7869,  44.9777, 473.3902, 106.7382],
    [ 15.7797, 415.2047, 476.4118, 686.4136]])), gt_classes: tensor([0, 0])])}

在annotations文件中,对应的标注信息是:
{"category_id": 1, "id": 317, "image_id": 174, "iscrowd": 0, "segmentation": [[137.76953125, 1297.650390625, 1509.9000000000015, 1297.650390625, 1509.9000000000015, 2105.5, 137.76953125, 2105.5]], "area": 1108576.0, "bbox": [138.0, 1298.0, 1372.0, 808.0]}
{"category_id": 1, "id": 316, "image_id": 174, "iscrowd": 0, "segmentation": [[146.541015625, 194.87890625, 1507.0552978515625, 194.87890625, 1507.0552978515625, 379.3728790283203, 146.541015625, 379.3728790283203]], "area": 250240.0, "bbox": [147.0, 195.0, 1360.0, 184.0]},

经过计算,height/image_height ≈ width/ image_width
然而,原始gt bbox(转换为x1,y1,x2,y2的格式为[138, 1298, 1510, 2106],[147, 195, 1507, 379])和batched_inputs中的bbox()并没有高和宽那样的比例关系,这里是正常的吗?
但是,惊讶的是,当我在centerX/modeling/layers/centernet_gt.py中generate函数 将画图部分代码取消注释后,观察了许多结果图片,框的位置并没有问题。
我有注意到,其实每张图片的shape是不同的,但generate函数里只传入了当前batch最后一张图的shape,并将所有图片按照这个shape(after scale)输出后续的gt,以确保一个batch里的score map是相同shape,这里会是症结所在吗?(原centernet是将图片resize为统一尺寸后,再进行后续的下采样,建gt等)

我现在很迷茫,不知道该如何解决这个问题,希望作者及了解的朋友可以指点迷津,万分感谢!

@5RJ
Copy link
Author

5RJ commented Dec 11, 2020

图片似乎显示不出来,图片 内容如下:
COCOeval_opt.evaluate() finished in 0.16 seconds.
Accumulating evaluation results...
COCOeval_opt.accumulate() finished in 0.02 seconds.
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.001
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = -1.000
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.001
[12/11 10:13:53 d2.evaluation.coco_evaluation]: Evaluation results for bbox:
| AP | AP50 | AP75 | APs | APm | APl |
| 0.000 | 0.000 | 0.000 | nan | nan | 0.000 |
[12/11 10:13:53 d2.evaluation.coco_evaluation]: Some metrics cannot be computed and is shown as NaN.
[12/11 10:13:53 d2.engine.defaults]: Evaluation results for table_aline_val in csv format:
[12/11 10:13:53 d2.evaluation.testing]: copypaste: Task: bbox
[12/11 10:13:53 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APs,APm,APl
[12/11 10:13:53 d2.evaluation.testing]: copypaste: 0.0000,0.0001,0.0000,nan,nan,0.0000

@CPFLAME
Copy link
Collaborator

CPFLAME commented Dec 14, 2020

非常感谢你详细的描述!接下来我们来解决问题:

  1. 阶段性保存模型的正确打开方式:cfg.SOLVER.CHECKPOINT_PERIOD,这个字段是用来调整保存模型的,你只需要在你的yaml的solver里面加一个这个字段就可以了
SOLVER:
    CHECKPOINT_PERIOD: 2 (每2个epoch数量保存一次模型,具体可以根据你的实际需求调整)
  1. 看上去你是用的你自己的私库,用自己的私库的话记得调整CENTERENT的NUM_CLASSES为你自己的类别数。
  2. val AP为0的可能问题:
    1)你可以根据自己的私库,适当的调小BASE_LR,太大了在简单的数据集上可能不会收敛
    2)你的图片像素值全部是170有点奇怪,不过我看到你自己在图上画了框,那应该没有问题
    3)centerX和原始的centernet实现不太一样,我是复用的detectron2里面的random crop,而且每一个batch的shape都可能和上一个batch的不一样,这取决于数据集图片的shape。
  3. 可以先试一下在原始的coco数据集上能否正常训练,再对比一下自己的私库有哪些改动。

@lbin
Copy link

lbin commented Dec 14, 2020

@CPFLAME 使用coco训练也会出现类似问题

@CPFLAME
Copy link
Collaborator

CPFLAME commented Dec 14, 2020

@lbin 真是糟糕的消息,一度怀疑自己的代码哪里搞错了0 0.

可以看一下你的config或者改动么,我用默认的centernet_res18_coco_0.5.yaml跑结果是正常的。

@lbin
Copy link

lbin commented Dec 14, 2020

centernet_res18_coco_0.5.yaml 跑10次大概有2~3次是0.0000的mAP,什么也没改

@CPFLAME
Copy link
Collaborator

CPFLAME commented Dec 14, 2020

啊这。。
这个bug之前困扰了我很久,一度以为自己解决了

加了COMMUNISM,或者调小BASE_LR可能会更稳定一些

MODEL:
  CENTERNET:
    LOSS:
      COMMUNISM:
        ENABLE: True
        CLS_LOSS: 1.5
        WH_LOSS: 0.3
        OFF_LOSS: 0.1

@5RJ
Copy link
Author

5RJ commented Dec 15, 2020

感谢解答,我试试!

@zc-tx
Copy link

zc-tx commented Dec 17, 2020

按照readme配好环境后,直接基于coco数据集训练(./run.sh),代码运行不成功:
centerX/engine/defaults.py:71行处super(DefaultTrainer, self).init(model, data_loader, optimizer)在Detectron2的源码中未定义,作者是改过Detectron2的源码吗?@CPFLAME
Solved by pip install -U 'git+https://github.com/CPFLAME/detectron2.git'

@lbin
Copy link

lbin commented Dec 17, 2020

@zc-tx pip install -U 'git+https://github.com/CPFLAME/detectron2.git' in https://github.com/CPFLAME/centerX/blob/master/README.md#requirements

@Fly-dream12
Copy link

你好,用骨干网络res50训练出来的效果较差,请问提供其他的骨干网络的配置吗?@CPFLAME

@CPFLAME
Copy link
Collaborator

CPFLAME commented Dec 25, 2020

@Fly-dream12 目前只有resnet和regnet,需要的话可以自行在backbone里面添加自己的网络

@Fly-dream12
Copy link

@CPFLAME 请问这个项目里面有添加feature loss吗? 好像没看到,是否能提供一个实例呢

@liujia761
Copy link

@5RJ 好喽,您好,我想用自己得两类数据训练一个学生模型,请问该如何使用?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants