Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to resume learning? #2569

Open
kazuma0606 opened this issue May 12, 2022 · 10 comments
Open

How to resume learning? #2569

kazuma0606 opened this issue May 12, 2022 · 10 comments
Labels

Comments

@kazuma0606
Copy link

❓ Questions/Help/Support

Hi, support teams.
This is my first time asking a question.
I believe the following code will load the checkpoints.

===================================================
checkpoint_path = "/tmp/cycle_gan_checkpoints/checkpoint_26500.pt"

    # let's save this checkpoint to W&B
    if wb_logger is not None:
        wb_logger.save(checkpoint_path)

===================================================

If the learning process takes a long time, there will be interruptions along the way.
In such a case, what code can I use to resume learning?
We look forward to hearing from you. Regards.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented May 12, 2022

Hi @kazuma0606 ,
Please check the following docs and examples:

In few lines of code, you can do the following:

to_save = {"trainer": trainer, "model": model, "optimizer": optimizer, "lr_scheduler": lr_scheduler}

resume_from = config["resume_from"]
if resume_from is not None:
checkpoint_fp = Path(resume_from)
assert checkpoint_fp.exists(), f"Checkpoint '{checkpoint_fp.as_posix()}' is not found"
logger.info(f"Resume from a checkpoint: {checkpoint_fp.as_posix()}")
checkpoint = torch.load(checkpoint_fp.as_posix(), map_location="cpu")
Checkpoint.load_objects(to_load=to_save, checkpoint=checkpoint)

HTH

@kazuma0606
Copy link
Author

kazuma0606 commented May 12, 2022

Thanks for reply.
I tryed a following links.
https://github.com/pytorch/ignite/blob/master/examples/notebooks/CycleGAN_with_torch_cuda_amp.ipynb

Am I correct in assuming that the code below does not include the epoch and loss information to resume learning?
Regards.

from ignite.handlers import ModelCheckpoint, TerminateOnNan
checkpoint_handler = ModelCheckpoint(dirname="/content/drive/My Drive/Colab Notebooks/CycleGAN_Project/pytorch-CycleGAN-and-pix2pix/datasets/T1W2T2W/cpk", filename_prefix="",require_empty=False)

to_save = {
    "generator_A2B": generator_A2B,
    "discriminator_B": discriminator_B,
    "generator_B2A": generator_B2A,
    "discriminator_A": discriminator_A,
    
    "optimizer_G": optimizer_G,
    "optimizer_D": optimizer_D,
}

trainer.add_event_handler(Events.ITERATION_COMPLETED(every=500), checkpoint_handler, to_save)
trainer.add_event_handler(Events.ITERATION_COMPLETED, TerminateOnNan())

@vfdev-5
Copy link
Collaborator

vfdev-5 commented May 12, 2022

@kazuma0606 Yes, you are correct. In order to save epoch and iteration, we need to save trainer:

to_save = {
    "generator_A2B": generator_A2B,
    "discriminator_B": discriminator_B,
    "generator_B2A": generator_B2A,
    "discriminator_A": discriminator_A,
    
    "optimizer_G": optimizer_G,
    "optimizer_D": optimizer_D,

    "trainer": trainer
}

As for batch loss, there is no need to save it as once restored models they would give similar batch loss values.
As for average running losses RunningAverage, unfortunately, they can't be restored. It's still a feature we would like to have.

@kazuma0606
Copy link
Author

Hi, @vfdev-5
Thanks for reply.
I was able to resume training without incident.
I am a little curious, is the following feature valuable from an academic point of view?
As for average running losses RunningAverage, unfortunately, they can't be restored. It's still a feature we would like to have.
Since this is a topic not really related to the case, please send me your e-mail or any social networking accounts. I hope you have an e-mail or a social networking account.

Regards.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented May 13, 2022

Hi @kazuma0606

I am a little curious, is the following feature valuable from an academic point of view?

I'm not sure about academic PoV but if it is about deterministic training and reproducibility while resuming from a checkpoint there are few things to take into account:

  • dataflow and random ops
  • resuming from start of the epoch or middle of the epoch (from iteration)

More info: https://pytorch.org/ignite/engine.html#deterministic-training

Since this is a topic not really related to the case, please send me your e-mail or any social networking accounts. I hope you have an e-mail or a social networking account.

We have to communicate with the team :

See also : https://github.com/pytorch/ignite#communication

As for average running losses RunningAverage, unfortunately, they can't be restored. It's still a feature we would like to have.

We can try to prioritize this feature. Related already issue open #966

@kazuma0606
Copy link
Author

Hi, @vfdev-5
Sorry for the delay in responding.
Thank you for your contact information.
I will email you separately on topics not related to this case.
By the way, regarding the following notebook.
CycleGAN_with_torch_cuda_amp.ipynb
In this notebook,
For the following functions.

@trainer.on(Events.EPOCH_STARTED)
def run_evaluation(engine):
    evaluator.run(eval_train_loader)
    evaluator.run(eval_test_loader)

def log_generated_images(engine, logger, event_name):

Functions run_evaluation() and log_generated_images() are called automatically at the start of training and can capture variables like lambda expressions.
Am I correct in my understanding?

@vfdev-5
Copy link
Collaborator

vfdev-5 commented May 17, 2022

Hi @kazuma0606

Functions run_evaluation() and log_generated_images() are called automatically at the start of training

Complete code is the following:

@trainer.on(Events.EPOCH_STARTED)
def run_evaluation(engine):
    evaluator.run(eval_train_loader)
    evaluator.run(eval_test_loader)


def log_generated_images(engine, logger, event_name):

    # ...

tb_logger.attach(evaluator,
                 log_handler=log_generated_images, 
                 event_name=Events.COMPLETED)

As you can see trainer has run_evaluation attached on EPOCH_STARTED so, every epoch started it will execute run_evaluation. Then you see that tb_logger attaches log_generated_images on COMPLETED for evaluator engine. Thus, trainer calls run_evaluation where evaluator runs on and once it is done (completed) it calls log_generated_images via tb_logger.

can capture variables like lambda expressions.

Yes, I think you use any variables in these functions from your global scope. If you want to pass explicitly an argument you can do something like:

another_lambda = lambda : "check another lambda"

@trainer.on(Events.EPOCH_STARTED, lambda : "check lambda")
def run_evaluation(engine, fn):
    print(fn(), another_lambda())

@kazuma0606
Copy link
Author

Hi, @vfdev-5
Thank you for the very clear explanation.
By the way, the learning was interrupted by the following message in the middle of the learning .

2022-05-18 01:25:03,792 ignite.handlers.terminate_on_nan.TerminateOnNan WARNING: TerminateOnNan: Output '{'loss_generators': nan, 'loss_generator_a2b': nan, 'loss_generator_b2a': 1.1836220026016235, 'loss_discriminators': 0.06603499501943588, 'loss_discriminator_a': 0.10396641492843628, 'loss_discriminator_b': 0.028103578835725784}' contains NaN or Inf. Stop training
TerminateOnNan: Output '{'loss_generators': nan, 'loss_generator_a2b': nan, 'loss_generator_b2a': 1.1836220026016235, 'loss_discriminators': 0.06603499501943588, 'loss_discriminator_a': 0.10396641492843628, 'loss_discriminator_b': 0.028103578835725784}' contains NaN or Inf. Stop training
State:
	iteration: 1878924
	epoch: 81
	epoch_length: 23250
	max_epochs: 200
	max_iters: <class 'NoneType'>
	output: <class 'dict'>
	batch: <class 'dict'>
	metrics: <class 'dict'>
	dataloader: <class 'torch.utils.data.dataloader.DataLoader'>
	seed: <class 'NoneType'>
	times: <class 'dict'>

By the way, is the TerminateOnNan flag a function to suppress over-learning?
Also, if this flag is true, is there any point in learning any more?
Or, if I increase the number of cases, will I be able to turn more epochs?
Does mixed-precision learning also have any positive effects?

Sorry for all the questions.
Regards.

@kazuma0606
Copy link
Author

I don't know if this is relevant, but I had to prepare and learning Dataset on my own.
I was learning with brain MRI images, but it seems that the type of images for the training and test were different.
Is it possible that learning stops early in such cases?

@vfdev-5
Copy link
Collaborator

vfdev-5 commented May 18, 2022

Hi @kazuma0606

By the way, is the TerminateOnNan flag a function to suppress over-learning?

When loss goes Nan, learning is not possible anymore as weights are Nan as well and we just waste resources. TerminateOnNan handler helps to stop the training as Nan is encountered.

Loss can go Nan in various cases:

  • LR too large => decrease the learning rate and restart the training
  • Mixed-precision => disable AMP flag and restart the training
  • Maybe, if input data is corrupted

Or, if I increase the number of cases, will I be able to turn more epochs?

I'm not sure to understand your point here, sorry

Does mixed-precision learning also have any positive effects?

Yes, less GPU memory usage and faster training on Nvidia GPUs with Turing cores

I was learning with brain MRI images, but it seems that the type of images for the training and test were different.
Is it possible that learning stops early in such cases?

I do not think that your data is responsible for Nan, try 2 above points before and see if it helps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants