How to resume learning? #2569

kazuma0606 · 2022-05-12T03:47:37Z

❓ Questions/Help/Support

Hi, support teams.
This is my first time asking a question.
I believe the following code will load the checkpoints.

===================================================
checkpoint_path = "/tmp/cycle_gan_checkpoints/checkpoint_26500.pt"

    # let's save this checkpoint to W&B
    if wb_logger is not None:
        wb_logger.save(checkpoint_path)

===================================================

If the learning process takes a long time, there will be interruptions along the way.
In such a case, what code can I use to resume learning?
We look forward to hearing from you. Regards.

The text was updated successfully, but these errors were encountered:

vfdev-5 · 2022-05-12T07:42:15Z

Hi @kazuma0606 ,
Please check the following docs and examples:

In few lines of code, you can do the following:

ignite/examples/contrib/cifar10/main.py

Line 334 in 315b6b9

 to_save = {"trainer": trainer, "model": model, "optimizer": optimizer, "lr_scheduler": lr_scheduler} 

ignite/examples/contrib/cifar10/main.py

Lines 351 to 357 in 315b6b9

 resume_from = config["resume_from"] 

 if resume_from is not None: 

 checkpoint_fp = Path(resume_from) 

 assert checkpoint_fp.exists(), f"Checkpoint '{checkpoint_fp.as_posix()}' is not found" 

 logger.info(f"Resume from a checkpoint: {checkpoint_fp.as_posix()}") 

 checkpoint = torch.load(checkpoint_fp.as_posix(), map_location="cpu") 

 Checkpoint.load_objects(to_load=to_save, checkpoint=checkpoint)

HTH

kazuma0606 · 2022-05-12T08:54:46Z

Thanks for reply.
I tryed a following links.
https://github.com/pytorch/ignite/blob/master/examples/notebooks/CycleGAN_with_torch_cuda_amp.ipynb

Am I correct in assuming that the code below does not include the epoch and loss information to resume learning?
Regards.

from ignite.handlers import ModelCheckpoint, TerminateOnNan
checkpoint_handler = ModelCheckpoint(dirname="/content/drive/My Drive/Colab Notebooks/CycleGAN_Project/pytorch-CycleGAN-and-pix2pix/datasets/T1W2T2W/cpk", filename_prefix="",require_empty=False)

to_save = {
    "generator_A2B": generator_A2B,
    "discriminator_B": discriminator_B,
    "generator_B2A": generator_B2A,
    "discriminator_A": discriminator_A,
    
    "optimizer_G": optimizer_G,
    "optimizer_D": optimizer_D,
}

trainer.add_event_handler(Events.ITERATION_COMPLETED(every=500), checkpoint_handler, to_save)
trainer.add_event_handler(Events.ITERATION_COMPLETED, TerminateOnNan())

vfdev-5 · 2022-05-12T09:21:47Z

@kazuma0606 Yes, you are correct. In order to save epoch and iteration, we need to save trainer:

to_save = {
    "generator_A2B": generator_A2B,
    "discriminator_B": discriminator_B,
    "generator_B2A": generator_B2A,
    "discriminator_A": discriminator_A,
    
    "optimizer_G": optimizer_G,
    "optimizer_D": optimizer_D,

    "trainer": trainer
}

As for batch loss, there is no need to save it as once restored models they would give similar batch loss values.
As for average running losses RunningAverage, unfortunately, they can't be restored. It's still a feature we would like to have.

kazuma0606 · 2022-05-13T03:19:32Z

Hi, @vfdev-5
Thanks for reply.
I was able to resume training without incident.
I am a little curious, is the following feature valuable from an academic point of view?
As for average running losses RunningAverage, unfortunately, they can't be restored. It's still a feature we would like to have.
Since this is a topic not really related to the case, please send me your e-mail or any social networking accounts. I hope you have an e-mail or a social networking account.

Regards.

vfdev-5 · 2022-05-13T10:29:49Z

Hi @kazuma0606

I am a little curious, is the following feature valuable from an academic point of view?

I'm not sure about academic PoV but if it is about deterministic training and reproducibility while resuming from a checkpoint there are few things to take into account:

dataflow and random ops
resuming from start of the epoch or middle of the epoch (from iteration)

More info: https://pytorch.org/ignite/engine.html#deterministic-training

Since this is a topic not really related to the case, please send me your e-mail or any social networking accounts. I hope you have an e-mail or a social networking account.

We have to communicate with the team :

email : [email protected]
twitter : https://twitter.com/pytorch_ignite
discord : https://pytorch-ignite.ai/chat

See also : https://github.com/pytorch/ignite#communication

As for average running losses RunningAverage, unfortunately, they can't be restored. It's still a feature we would like to have.

We can try to prioritize this feature. Related already issue open #966

kazuma0606 · 2022-05-17T05:58:12Z

Hi, @vfdev-5
Sorry for the delay in responding.
Thank you for your contact information.
I will email you separately on topics not related to this case.
By the way, regarding the following notebook.
CycleGAN_with_torch_cuda_amp.ipynb
In this notebook,
For the following functions.

@trainer.on(Events.EPOCH_STARTED)
def run_evaluation(engine):
    evaluator.run(eval_train_loader)
    evaluator.run(eval_test_loader)

def log_generated_images(engine, logger, event_name):

Functions run_evaluation() and log_generated_images() are called automatically at the start of training and can capture variables like lambda expressions.
Am I correct in my understanding?

vfdev-5 · 2022-05-17T11:21:47Z

Hi @kazuma0606

Functions run_evaluation() and log_generated_images() are called automatically at the start of training

Complete code is the following:

@trainer.on(Events.EPOCH_STARTED)
def run_evaluation(engine):
    evaluator.run(eval_train_loader)
    evaluator.run(eval_test_loader)


def log_generated_images(engine, logger, event_name):

    # ...

tb_logger.attach(evaluator,
                 log_handler=log_generated_images, 
                 event_name=Events.COMPLETED)

As you can see trainer has run_evaluation attached on EPOCH_STARTED so, every epoch started it will execute run_evaluation. Then you see that tb_logger attaches log_generated_images on COMPLETED for evaluator engine. Thus, trainer calls run_evaluation where evaluator runs on and once it is done (completed) it calls log_generated_images via tb_logger.

can capture variables like lambda expressions.

Yes, I think you use any variables in these functions from your global scope. If you want to pass explicitly an argument you can do something like:

another_lambda = lambda : "check another lambda"

@trainer.on(Events.EPOCH_STARTED, lambda : "check lambda")
def run_evaluation(engine, fn):
    print(fn(), another_lambda())

kazuma0606 · 2022-05-18T01:38:15Z

Hi, @vfdev-5
Thank you for the very clear explanation.
By the way, the learning was interrupted by the following message in the middle of the learning .

2022-05-18 01:25:03,792 ignite.handlers.terminate_on_nan.TerminateOnNan WARNING: TerminateOnNan: Output '{'loss_generators': nan, 'loss_generator_a2b': nan, 'loss_generator_b2a': 1.1836220026016235, 'loss_discriminators': 0.06603499501943588, 'loss_discriminator_a': 0.10396641492843628, 'loss_discriminator_b': 0.028103578835725784}' contains NaN or Inf. Stop training
TerminateOnNan: Output '{'loss_generators': nan, 'loss_generator_a2b': nan, 'loss_generator_b2a': 1.1836220026016235, 'loss_discriminators': 0.06603499501943588, 'loss_discriminator_a': 0.10396641492843628, 'loss_discriminator_b': 0.028103578835725784}' contains NaN or Inf. Stop training
State:
	iteration: 1878924
	epoch: 81
	epoch_length: 23250
	max_epochs: 200
	max_iters: <class 'NoneType'>
	output: <class 'dict'>
	batch: <class 'dict'>
	metrics: <class 'dict'>
	dataloader: <class 'torch.utils.data.dataloader.DataLoader'>
	seed: <class 'NoneType'>
	times: <class 'dict'>

By the way, is the TerminateOnNan flag a function to suppress over-learning?
Also, if this flag is true, is there any point in learning any more?
Or, if I increase the number of cases, will I be able to turn more epochs?
Does mixed-precision learning also have any positive effects?

Sorry for all the questions.
Regards.

kazuma0606 · 2022-05-18T07:10:27Z

I don't know if this is relevant, but I had to prepare and learning Dataset on my own.
I was learning with brain MRI images, but it seems that the type of images for the training and test were different.
Is it possible that learning stops early in such cases?

vfdev-5 · 2022-05-18T07:47:28Z

Hi @kazuma0606

By the way, is the TerminateOnNan flag a function to suppress over-learning?

When loss goes Nan, learning is not possible anymore as weights are Nan as well and we just waste resources. TerminateOnNan handler helps to stop the training as Nan is encountered.

Loss can go Nan in various cases:

LR too large => decrease the learning rate and restart the training
Mixed-precision => disable AMP flag and restart the training
Maybe, if input data is corrupted

Or, if I increase the number of cases, will I be able to turn more epochs?

I'm not sure to understand your point here, sorry

Does mixed-precision learning also have any positive effects?

Yes, less GPU memory usage and faster training on Nvidia GPUs with Turing cores

I was learning with brain MRI images, but it seems that the type of images for the training and test were different.
Is it possible that learning stops early in such cases?

I do not think that your data is responsible for Nan, try 2 above points before and see if it helps

kazuma0606 added the question label May 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to resume learning? #2569

How to resume learning? #2569

kazuma0606 commented May 12, 2022

vfdev-5 commented May 12, 2022 •

edited

kazuma0606 commented May 12, 2022 •

edited by vfdev-5

vfdev-5 commented May 12, 2022

kazuma0606 commented May 13, 2022

vfdev-5 commented May 13, 2022 •

edited

kazuma0606 commented May 17, 2022

vfdev-5 commented May 17, 2022

kazuma0606 commented May 18, 2022

kazuma0606 commented May 18, 2022

vfdev-5 commented May 18, 2022

How to resume learning? #2569

How to resume learning? #2569

Comments

kazuma0606 commented May 12, 2022

❓ Questions/Help/Support

vfdev-5 commented May 12, 2022 • edited

kazuma0606 commented May 12, 2022 • edited by vfdev-5

vfdev-5 commented May 12, 2022

kazuma0606 commented May 13, 2022

vfdev-5 commented May 13, 2022 • edited

kazuma0606 commented May 17, 2022

vfdev-5 commented May 17, 2022

kazuma0606 commented May 18, 2022

kazuma0606 commented May 18, 2022

vfdev-5 commented May 18, 2022

vfdev-5 commented May 12, 2022 •

edited

kazuma0606 commented May 12, 2022 •

edited by vfdev-5

vfdev-5 commented May 13, 2022 •

edited