Loss NaN about using vovnet as backbone in RetinaNet #8

y200504040u · 2020-03-16T09:57:36Z

Hi! Thank you for your great work.
I wanted to improve RetinaNet project in detectron2/projects by replacing "retinanet_resnet_fpn_backbone" with "retinanet_vovnet_fpn_backbone".
However, I always encounterd "loss NaN" in period of less than 1000 iterations during training .
Training by "retinanet_resnet_fpn_backbone" is OK.

I want to make sure that I wasn't doing something wrong.

my config yaml:

_BASE_: "../Base-RetinaNet.yaml"
MODEL:
  WEIGHTS: "./pre_train/vovnet39_ese_detectron2.pth"
  RETINANET:
    NUM_CLASSES: 2
  BACKBONE:
    NAME: "build_retinanet_vovnet_fpn_backbone"
    FREEZE_AT: 0
  VOVNET:
    CONV_BODY : "V-39-eSE"
    OUT_FEATURES: ["stage3", "stage4", "stage5"]
  FPN:
    IN_FEATURES: ["stage3", "stage4", "stage5"]
SOLVER:
  STEPS: (210000, 250000)
  MAX_ITER: 270000
OUTPUT_DIR: "output/retina/V_39_ms_3x"

build_retinanet_vovnet_fpn_backbone

@BACKBONE_REGISTRY.register()
def build_retinanet_vovnet_fpn_backbone(cfg, input_shape: ShapeSpec):
    """
    Args:
        cfg: a detectron2 CfgNode

    Returns:
        backbone (Backbone): backbone module, must be a subclass of :class:`Backbone`.
    """

    bottom_up = build_vovnet_backbone(cfg, input_shape)
    in_features = cfg.MODEL.FPN.IN_FEATURES
    out_channels = cfg.MODEL.FPN.OUT_CHANNELS
    in_channels_top = out_channels
    top_block = LastLevelP6P7(in_channels_top, out_channels, "p5")
    # in_channels_p6p7 = bottom_up.output_shape()["res5"].channels
    backbone = FPN(
        bottom_up=bottom_up,
        in_features=in_features,
        out_channels=out_channels,
        norm=cfg.MODEL.FPN.NORM,
        top_block=top_block,
        # top_block=LastLevelP6P7(in_channels_p6p7, out_channels),
        fuse_type=cfg.MODEL.FPN.FUSE_TYPE,
    )
    return backbone

The text was updated successfully, but these errors were encountered:

cxx921656591 · 2020-03-17T01:28:03Z

Nice copy LOL. By the way, I think it's because your learning rate is too big. I think you can try to lower it 10-100 times. And don't forget to longer your iteration.

y200504040u · 2020-03-17T08:15:06Z

Nice copy LOL. By the way, I think it's because your learning rate is too big. I think you can try to lower it 10-100 times. And don't forget to longer your iteration.

cut-and-pasted😂...
I tried lower learning rate, I got loss without decreasing instead of loss explosion.
I read vovNet paper, author didn't use vovNet to be backbone in any object detection network except RefineDet in experiments.

Cyril9227 · 2020-03-30T01:46:39Z

Same error, can't manage to fit a vovnet-lite-dw or a vovnet-19-dw, keep getting NaN loss. Vovnet-lite is fine tho, I have the feeling that there is something wrong with the depthwise convolution.

lsrock1 · 2020-04-08T03:29:09Z

When I tested this kind of lightweight backbone in object detection (ex, mobilenet, shufflenet etc..), i set warm up iter longer.

y200504040u closed this as completed Mar 16, 2020

y200504040u reopened this Mar 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss NaN about using vovnet as backbone in RetinaNet #8

Loss NaN about using vovnet as backbone in RetinaNet #8

y200504040u commented Mar 16, 2020 •

edited

cxx921656591 commented Mar 17, 2020

y200504040u commented Mar 17, 2020

Cyril9227 commented Mar 30, 2020

lsrock1 commented Apr 8, 2020

Loss NaN about using vovnet as backbone in RetinaNet #8

Loss NaN about using vovnet as backbone in RetinaNet #8

Comments

y200504040u commented Mar 16, 2020 • edited

cxx921656591 commented Mar 17, 2020

y200504040u commented Mar 17, 2020

Cyril9227 commented Mar 30, 2020

lsrock1 commented Apr 8, 2020

y200504040u commented Mar 16, 2020 •

edited