Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArithmeticError: NaN detected in train loss #102

Open
davidecarnevali opened this issue Aug 10, 2023 · 2 comments
Open

ArithmeticError: NaN detected in train loss #102

davidecarnevali opened this issue Aug 10, 2023 · 2 comments

Comments

@davidecarnevali
Copy link

I run the example code for pipeline below:

from pythae.pipelines import TrainingPipeline
from pythae.models import VAE, VAEConfig
from pythae.trainers import BaseTrainerConfig
from torch.utils.data import Dataset
import torchvision.datasets as datasets
import torchvision.transforms as T
from pythae.data.datasets import DatasetOutput

data_transform = T.ToTensor()
data_folder = "./data/"
class MyCustomDataset(datasets.ImageFolder):
    
    def __init__(self, root, transform=None, target_transform=None):
        super().__init__(root=root, transform=transform, target_transform=target_transform)
        
    def __getitem__(self, index):
        X, _ = super().__getitem__(index)
        return DatasetOutput(
            data=X
        )


train_dataset = MyCustomDataset(
    root=data_folder + "/train",
    transform=data_transform,
)

eval_dataset = MyCustomDataset(
    root=data_folder + "/val", 
    transform=data_transform
)

# Set up the training configuration
my_training_config = BaseTrainerConfig(
    output_dir='Pythae_model',
    num_epochs=500,
    learning_rate=1e-3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    train_dataloader_num_workers=2,
    eval_dataloader_num_workers=2,
    steps_saving=20,
    optimizer_cls="AdamW",
    optimizer_params={"weight_decay": 0.05, "betas": (0.91, 0.995)},
    scheduler_cls="ReduceLROnPlateau",
    scheduler_params={"patience": 5, "factor": 0.5}
)
# Set up the model configuration 
my_vae_config = model_config = VAEConfig(
    input_dim=(3, 512, 512),
    latent_dim=10
)
# Build the model
my_vae_model = VAE(
    model_config=my_vae_config
)
# Build the Pipeline
pipeline = TrainingPipeline(
    training_config=my_training_config,
    model=my_vae_model
)
# Launch the Pipeline
pipeline(
    train_data=train_dataset, # must be torch.Tensor, np.array or torch datasets
    eval_data=eval_dataset # must be torch.Tensor, np.array or torch datasets
)

it used to work (at least when using 50 epochs) but now when I run it with 500 epochs I got already at early epochs:

---------------------------------------------------------------------------
ArithmeticError                           Traceback (most recent call last)
/tmp/ipykernel_31933/3300194478.py in <module>
     63 pipeline(
     64     train_data=train_dataset, # must be torch.Tensor, np.array or torch datasets
---> 65     eval_data=eval_dataset # must be torch.Tensor, np.array or torch datasets
     66 )

/nfs/users/pcosma/dcarnevali/pyenvs/AINU/lib/python3.7/site-packages/pythae/pipelines/training.py in __call__(self, train_data, eval_data, callbacks)
    239         self.trainer = trainer
    240 
--> 241         trainer.train()

/nfs/users/pcosma/dcarnevali/pyenvs/AINU/lib/python3.7/site-packages/pythae/trainers/base_trainer/base_trainer.py in train(self, log_output_dir)
    431             metrics = {}
    432 
--> 433             epoch_train_loss = self.train_step(epoch)
    434             metrics["train_epoch_loss"] = epoch_train_loss
    435 

/nfs/users/pcosma/dcarnevali/pyenvs/AINU/lib/python3.7/site-packages/pythae/trainers/base_trainer/base_trainer.py in train_step(self, epoch)
    606 
    607             if epoch_loss != epoch_loss:
--> 608                 raise ArithmeticError("NaN detected in train loss")
    609 
    610             self.callback_handler.on_train_step_end(

ArithmeticError: NaN detected in train loss

I saw the same problem reported on ticket #79 and it seems has been fixed for SVAE

Could you please check?

Thank you

D.

@clementchadebec
Copy link
Owner

Hi @davidecarnevali,

Thank you for opening this issue. Changing the number of training epochs should not affect the training for the VAE model. Have you tried decreasing the learning rate? At which epoch are you experiencing this issue? You can also try to rescale your data in [0-1] to avoid high losses.

Let me know if any of these suggestions worked for you or if you are still experiencing this issue.

Best,

Clément

@davidecarnevali
Copy link
Author

davidecarnevali commented Sep 19, 2023

Hi,
the Pytorch ToTensor transform already rescale data into 0-1 range.
The issue arises at different epochs each time.... sometimes at 4th other 7th
Changing the LR to 1e-4 now gives me another error after 133 epochs:

`OverflowError Traceback (most recent call last)
/tmp/ipykernel_20761/1399248935.py in
31 pipeline(
32 train_data=train_dataset, # must be torch.Tensor, np.array or torch datasets
---> 33 eval_data=eval_dataset # must be torch.Tensor, np.array or torch datasets
34 )

/nfs/users/dcarnevali/pyenvs/AINU/lib/python3.7/site-packages/pythae/pipelines/training.py in call(self, train_data, eval_data, callbacks)
254
255 self.trainer = trainer
--> 256 trainer.train()

/nfs/users/dcarnevali/pyenvs/AINU/lib/python3.7/site-packages/pythae/trainers/base_trainer/base_trainer.py in train(self, log_output_dir)
459 metrics = {}
460
--> 461 epoch_train_loss = self.train_step(epoch)
462 metrics["train_epoch_loss"] = epoch_train_loss
463

/nfs/users/dcarnevali/pyenvs/AINU/lib/python3.7/site-packages/pythae/trainers/base_trainer/base_trainer.py in train_step(self, epoch)
631 )
632
--> 633 self._optimizers_step(model_output)
634
635 loss = model_output.loss

/nfs/users/dcarnevali/pyenvs/AINU/lib/python3.7/site-packages/pythae/trainers/base_trainer/base_trainer.py in _optimizers_step(self, model_output)
376 self.optimizer.zero_grad()
377 loss.backward()
--> 378 self.optimizer.step()
379
380 def _schedulers_step(self, metrics=None):

/nfs/users/dcarnevali/pyenvs/AINU/lib/python3.7/site-packages/torch/optim/optimizer.py in wrapper(*args, **kwargs)
138 profile_name = "Optimizer.step#{}.step".format(obj.class.name)
139 with torch.autograd.profiler.record_function(profile_name):
--> 140 out = func(*args, **kwargs)
141 obj._optimizer_step_code()
142 return out

/nfs/users/dcarnevali/pyenvs/AINU/lib/python3.7/site-packages/torch/autograd/grad_mode.py in decorate_context(*args, **kwargs)
25 def decorate_context(*args, **kwargs):
26 with self.clone():
---> 27 return func(*args, **kwargs)
28 return cast(F, decorate_context)
29

/nfs/users/dcarnevali/pyenvs/AINU/lib/python3.7/site-packages/torch/optim/adamw.py in step(self, closure)
174 maximize=group['maximize'],
175 foreach=group['foreach'],
--> 176 capturable=group['capturable'])
177
178 return loss

/nfs/users/dcarnevali/pyenvs/AINU/lib/python3.7/site-packages/torch/optim/adamw.py in adamw(params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, foreach, capturable, amsgrad, beta1, beta2, lr, weight_decay, eps, maximize)
230 eps=eps,
231 maximize=maximize,
--> 232 capturable=capturable)
233
234

/nfs/users/dcarnevali/pyenvs/AINU/lib/python3.7/site-packages/torch/optim/adamw.py in _single_tensor_adamw(params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, beta1, beta2, lr, weight_decay, eps, maximize, capturable)
301 step = step_t.item()
302
--> 303 bias_correction1 = 1 - beta1 ** step
304 bias_correction2 = 1 - beta2 ** step
305

OverflowError: (34, 'Numerical result out of range')`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants