-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sophiah in https://github.com/booydar/LM-RMT #194
Comments
hmm, I guess it's not the optimizer problem, but maybe Pytorch autograd internal or the training code (e.g. model, loss, etc) issue. I just found that a similar error occurs when the loss function is CPU-version loss. maybe, some modules are not on the same device or there're unreachable graphs (leading to not-backprop-able). |
Strange that it triggers only after so many steps seems like it would be a pytorch/sync issue. Just wanted to say, if you are using Cross-Entropy loss (for LM) SophiaG variant is more efficient (since it's just squaring the gradient, see https://github.com/Liuhong99/Sophia/blob/19f45d30723bbffcce3d18e4e858d95b0f36dbb6/sophia.py#L56), you can use it like so (not tested): hessian = list(map(lambda p: p.grad * p.grad, model.parameters()))
opt.step(hessian=hessian) This also skips the 2nd order gradient calculation, so it could resolve your issue. EDIT: you also need to filter out the non-trainable & sparse parameters so it would be more like: hessian = [p.grad*p.grad for p in model.parameters() if p.requires_grad and p.grad is not None and not p.grad.is_sparse]
opt.step(hessian=hessian) |
SophiaG worked, but the perfomace is not better than Adam, maybe because of the bias. So I want to try SophiaH, which hasn't the bias. |
Some last things to check:
If this is all correct then it pretty much has to be a bug in pytorch (or the training code). |
I have been running into a similar error message. I've been trying to use SophiaH with Lightning AI's
If I iterate through every parameter group, and set
If I set If "requires_grad" in unset for ANY parameter group, I get the original error message. I am unsure how to proceed at this point, but I would greatly appreciate any advice you have to offer. |
Hello!
here's an example. import os
from torch import optim, nn, utils, Tensor
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor
import lightning.pytorch as pl
from pytorch_optimizer import SophiaH
from torch.optim import Optimizer
# define any number of nn.Modules (or use your current ones)
encoder = nn.Sequential(nn.Linear(28 * 28, 64), nn.ReLU(), nn.Linear(64, 3))
decoder = nn.Sequential(nn.Linear(3, 64), nn.ReLU(), nn.Linear(64, 28 * 28))
class LitAutoEncoder(pl.LightningModule):
def __init__(self, encoder, decoder):
super().__init__()
self.encoder = encoder
self.decoder = decoder
self.automatic_optimization = False
def training_step(self, batch, batch_idx):
opt = self.optimizers()
opt.zero_grad()
x, y = batch
x = x.view(x.size(0), -1)
z = self.encoder(x)
x_hat = self.decoder(z)
loss = nn.functional.mse_loss(x_hat, x)
# important
self.manual_backward(loss, create_graph=True)
opt.step()
self.log("train_loss", loss)
def configure_optimizers(self):
return SophiaH(self.parameters())
dataset = MNIST(os.getcwd(), download=True, transform=ToTensor())
train_loader = utils.data.DataLoader(dataset)
autoencoder = LitAutoEncoder(encoder, decoder)
trainer = pl.Trainer(limit_train_batches=100, max_epochs=1)
trainer.fit(model=autoencoder, train_dataloaders=train_loader) |
Thank you for the quick response! I have applied your example to my own code (to the best of my ability), and while we're making progress, training bombs with a new error after reaching the first
I don't suspect this is the cause, but there is a warning at the beginning of training:
It may be relevant to know that I am using the Huggingface PEFT library for LoRA training. I don't suspect that is the issue either, since all that really does is add some extra layers to the model, and freeze all the other layers. I will troubleshoot some more when I get the chance. It's been a long day already, and I need to take a break. Thank you for the help thus far, and for maintaining such a useful library! |
Alright, well I was able to test your example MNIST code, and it does work. So I know this isn't an environment issue. I removed PEFT as well, and tried standard fine-tuning. I also tried a couple of different models (GPT-2 and GPT-Neo), from Huggingface Transformers library. All ran into the same problem with "tensors does not require grad and does not have a grad_fn". I'm sure the issue has to do with my training code. I'm carrying some legacy baggage, and I don't really have the proper skill set to know how to optimize manually (which is why I've relied on automatic_optimization until now). I haven't given up, but I probably am going to move on for now. I appreciate your help. |
#params = 151111638
#non emb params = 41066400
| epoch 1 step 50 | 50 batches | lr 0.06 | ms/batch 1378.43 | loss 7.85 | ppl 2570.784
| epoch 1 step 100 | 100 batches | lr 0.06 | ms/batch 968.61 | loss 7.49 | ppl 1787.593
| epoch 1 step 150 | 150 batches | lr 0.06 | ms/batch 971.58 | loss 7.48 | ppl 1769.387
| epoch 1 step 200 | 200 batches | lr 0.06 | ms/batch 969.84 | loss 7.47 | ppl 1760.055
| epoch 1 step 250 | 250 batches | lr 0.06 | ms/batch 973.37 | loss 7.46 | ppl 1738.300
| epoch 1 step 300 | 300 batches | lr 0.06 | ms/batch 970.12 | loss 7.48 | ppl 1772.002
| epoch 1 step 350 | 350 batches | lr 0.06 | ms/batch 970.52 | loss 7.47 | ppl 1751.793
| epoch 1 step 400 | 400 batches | lr 0.06 | ms/batch 973.12 | loss 7.47 | ppl 1755.161
| epoch 1 step 450 | 450 batches | lr 0.06 | ms/batch 970.79 | loss 7.46 | ppl 1736.315
| epoch 1 step 500 | 500 batches | lr 0.06 | ms/batch 974.13 | loss 7.48 | ppl 1765.010
| epoch 1 step 550 | 550 batches | lr 0.06 | ms/batch 973.86 | loss 7.48 | ppl 1778.569
Traceback (most recent call last):
File "/home/notebook/code/personal/80306170/AGI/LM-RMT/pytorch/train.py", line 620, in
train()
File "/home/notebook/code/personal/80306170/AGI/LM-RMT/pytorch/train.py", line 540, in train
optimizer.step()
File "/opt/conda/envs/dsd/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
return wrapped(*args, **kwargs)
File "/opt/conda/envs/dsd/lib/python3.10/site-packages/torch/optim/optimizer.py", line 280, in wrapper
out = func(*args, **kwargs)
File "/opt/conda/envs/dsd/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/envs/dsd/lib/python3.10/site-packages/pytorch_optimizer/optimizer/sophia.py", line 92, in step
self.compute_hutchinson_hessian(
File "/opt/conda/envs/dsd/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/envs/dsd/lib/python3.10/site-packages/pytorch_optimizer/base/optimizer.py", line 100, in compute_hutchinson_hessian
h_zs = torch.autograd.grad(grads, params, grad_outputs=zs, retain_graph=i < num_samples - 1)
File "/opt/conda/envs/dsd/lib/python3.10/site-packages/torch/autograd/init.py", line 303, in grad
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: res[i].defined() INTERNAL ASSERT FAILED at "../torch/csrc/autograd/functions/tensor.cpp":142, please report a bug to PyTorch.
The text was updated successfully, but these errors were encountered: