Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problem with resuming training from checkpoint #62

Open
mailtohrishi opened this issue Nov 16, 2022 · 0 comments
Open

problem with resuming training from checkpoint #62

mailtohrishi opened this issue Nov 16, 2022 · 0 comments

Comments

@mailtohrishi
Copy link

Hi... I am getting following error while resuming training from a checkpoint on a single GPU system. The training went fine when started from 0th iteration, but exited immediately after loading a checkpoint. The relevant excerpt that I have modified in main.py for that purpose is also shown below. Is it a bug or there's some mistake somewhere?

(command used)
sh scripts/cityscapes/ocrnet/run_r_101_d_8_ocrnet_train.sh resume x3

(modifications in main.py: ignore single quotes typed in here for proper display)
elif [ "$1"x == "resume"x ]; then
${PYTHON} -u main.py --configs '$'{CONFIGS} \
--drop_last y \
--phase train \
--gathered n \
--loss_balance y \
--log_to_file n \
--backbone ${BACKBONE} \
--model_name ${MODEL_NAME} \
--max_iters ${MAX_ITERS} \
--data_dir ${DATA_DIR} \
--loss_type ${LOSS_TYPE} \
--resume_continue y \
--resume ${CHECKPOINTS_ROOT}/checkpoints/bottle/'$'{CHECKPOINTS_NAME}_latest.pth \
--checkpoints_name ${CHECKPOINTS_NAME} \
--distributed False \
2>&1 | tee -a ${LOG_FILE}
#--gpu 0 1 2 3 **

2022-11-16 11:30:47,097 INFO [module_runner.py, 87] Loading checkpoint from /workspace/data/defGen/graphics/Pre_CL_x3//..//checkpoints/bottle/spatial_ocrnet_deepbase_resnet101_dilated8_x3_latest.pth...
2022-11-16 11:30:47,283 INFO [trainer.py, 90] Params Group Method: None
2022-11-16 11:30:47,285 INFO [optim_scheduler.py, 96] Use lambda_poly policy with default power 0.9
2022-11-16 11:30:47,285 INFO [data_loader.py, 132] use the DefaultLoader for train...
2022-11-16 11:30:47,773 INFO [default_loader.py, 38] train 501
2022-11-16 11:30:47,774 INFO [data_loader.py, 164] use DefaultLoader for val ...
2022-11-16 11:30:47,873 INFO [default_loader.py, 38] val 126
2022-11-16 11:30:47,873 INFO [loss_manager.py, 66] use loss: fs_auxce_loss.
2022-11-16 11:30:47,874 INFO [loss_manager.py, 55] use DataParallelCriterion loss
2022-11-16 11:30:48,996 INFO [data_helper.py, 126] Input keys: ['img']
2022-11-16 11:30:48,996 INFO [data_helper.py, 127] Target keys: ['labelmap']
Traceback (most recent call last):
File "main.py", line 227, in
model.train()
File "/workspace/defGen/External/ContrastiveSeg-main/segmentor/trainer.py", line 390, in train
self.__train()
File "/workspace/defGen/External/ContrastiveSeg-main/segmentor/trainer.py", line 196, in __train
backward_loss = display_loss = self.pixel_loss(outputs, targets,
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/defGen/External/ContrastiveSeg-main/lib/extensions/parallel/data_parallel.py", line 125, in forward
return self.module(inputs[0], *targets[0], **kwargs[0])
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/defGen/External/ContrastiveSeg-main/lib/loss/loss_helper.py", line 309, in forward
seg_loss = self.ce_loss(seg_out, targets)
File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
return forward_call(*input, **kwargs)
File "/workspace/defGen/External/ContrastiveSeg-main/lib/loss/loss_helper.py", line 203, in forward
target = self._scale_target(targets[0], (inputs.size(2), inputs.size(3)))
IndexError: Dimension out of range (expected to be in range of [-3, 2], but got 3)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant