Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss NaN about using vovnet as backbone in RetinaNet #8

Open
y200504040u opened this issue Mar 16, 2020 · 4 comments
Open

Loss NaN about using vovnet as backbone in RetinaNet #8

y200504040u opened this issue Mar 16, 2020 · 4 comments

Comments

@y200504040u
Copy link

y200504040u commented Mar 16, 2020

Hi! Thank you for your great work.
I wanted to improve RetinaNet project in detectron2/projects by replacing "retinanet_resnet_fpn_backbone" with "retinanet_vovnet_fpn_backbone".
However, I always encounterd "loss NaN" in period of less than 1000 iterations during training .
Training by "retinanet_resnet_fpn_backbone" is OK.

I want to make sure that I wasn't doing something wrong.

my config yaml:

_BASE_: "../Base-RetinaNet.yaml"
MODEL:
  WEIGHTS: "./pre_train/vovnet39_ese_detectron2.pth"
  RETINANET:
    NUM_CLASSES: 2
  BACKBONE:
    NAME: "build_retinanet_vovnet_fpn_backbone"
    FREEZE_AT: 0
  VOVNET:
    CONV_BODY : "V-39-eSE"
    OUT_FEATURES: ["stage3", "stage4", "stage5"]
  FPN:
    IN_FEATURES: ["stage3", "stage4", "stage5"]
SOLVER:
  STEPS: (210000, 250000)
  MAX_ITER: 270000
OUTPUT_DIR: "output/retina/V_39_ms_3x"

build_retinanet_vovnet_fpn_backbone

@BACKBONE_REGISTRY.register()
def build_retinanet_vovnet_fpn_backbone(cfg, input_shape: ShapeSpec):
    """
    Args:
        cfg: a detectron2 CfgNode

    Returns:
        backbone (Backbone): backbone module, must be a subclass of :class:`Backbone`.
    """

    bottom_up = build_vovnet_backbone(cfg, input_shape)
    in_features = cfg.MODEL.FPN.IN_FEATURES
    out_channels = cfg.MODEL.FPN.OUT_CHANNELS
    in_channels_top = out_channels
    top_block = LastLevelP6P7(in_channels_top, out_channels, "p5")
    # in_channels_p6p7 = bottom_up.output_shape()["res5"].channels
    backbone = FPN(
        bottom_up=bottom_up,
        in_features=in_features,
        out_channels=out_channels,
        norm=cfg.MODEL.FPN.NORM,
        top_block=top_block,
        # top_block=LastLevelP6P7(in_channels_p6p7, out_channels),
        fuse_type=cfg.MODEL.FPN.FUSE_TYPE,
    )
    return backbone
@cxx921656591
Copy link

Nice copy LOL. By the way, I think it's because your learning rate is too big. I think you can try to lower it 10-100 times. And don't forget to longer your iteration.

@y200504040u
Copy link
Author

Nice copy LOL. By the way, I think it's because your learning rate is too big. I think you can try to lower it 10-100 times. And don't forget to longer your iteration.

cut-and-pasted😂...
I tried lower learning rate, I got loss without decreasing instead of loss explosion.
I read vovNet paper, author didn't use vovNet to be backbone in any object detection network except RefineDet in experiments.

@Cyril9227
Copy link

Same error, can't manage to fit a vovnet-lite-dw or a vovnet-19-dw, keep getting NaN loss. Vovnet-lite is fine tho, I have the feeling that there is something wrong with the depthwise convolution.

@lsrock1
Copy link

lsrock1 commented Apr 8, 2020

When I tested this kind of lightweight backbone in object detection (ex, mobilenet, shufflenet etc..), i set warm up iter longer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants