Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Detected that PyTorch and torchvision were compiled with different CUDA versions. PyTorch has CUDA Version=11.7 and torchvision has CUDA Versio=11.6. Please reinstall the torchvision that matches your PyTorch install. #37

Open
G-force78 opened this issue Jan 12, 2023 · 12 comments

Comments

@G-force78
Copy link

When launching training

Seems to be an error everywhere with this so not specific to this repo. Any ideas how to fix?

@brian6091
Copy link
Owner

brian6091 commented Jan 12, 2023

Where are you running the script? If you are using the notebook, does the error occur when you launch the training? Or somewhere before?

I only have access to Google Colab, where the CUDA versions seems to match:

Description: Ubuntu 18.04.6 LTS
diffusers==0.11.1
torchvision @ https://download.pytorch.org/whl/cu116/torchvision-0.14.0%2Bcu116-cp38-cp38-linux_x86_64.whl
transformers==4.25.1
xformers @ https://github.com/brian6091/xformers-wheels/releases/download/0.0.15.dev0%2B4c06c79/xformers-0.0.15.dev0+4c06c79.d20221205-cp38-cp38-linux_x86_64.whl

Copy-and-paste the text below in your GitHub issue

  • Accelerate version: 0.15.0
  • Platform: Linux-5.10.147+-x86_64-with-glibc2.27
  • Python version: 3.8.16
  • Numpy version: 1.21.6
  • PyTorch version (GPU?): 1.13.0+cu116 (True)

@G-force78
Copy link
Author

G-force78 commented Jan 13, 2023

Thats odd..Yeah happens when the actual training cell is launched, maybe I have an outdated notebook will try the recent one.
very nice notebook to use by the way.

@brian6091
Copy link
Owner

Ok, I actually haven't tried the notebook on the main branch for awhile. I will test tonight. Thanks for reporting.

@G-force78
Copy link
Author

i think it needs updating and tweaking Im getting error after error from the training cell, nothing seems to be linked back to the previous cells where the parameters are chosen

@brian6091
Copy link
Owner

Are you referring to the notebook on the main branch?

@G-force78
Copy link
Author

@brian6091
Copy link
Owner

Ok thanks. I'll have a look today.

@brian6091
Copy link
Owner

So I've fixed a couple of things and checked that the dependencies are all ok (at least on Google Colab). Please try the Notebook linked below. Two things:

  1. I maintain this version on a different branch (https://github.com/brian6091/Dreambooth/tree/v0.0.2), so keep that version in mind since I will pull in >800 commits to main this weekend.

  2. You need to run all the cells in sequence so that all the parameters are defined in the workspace. Skipping anything (except the tensorboard visualization cell) will cause an error.

Open In Colab

@G-force78
Copy link
Author

Ok thanks, will give it a go

@G-force78
Copy link
Author

G-force78 commented Jan 15, 2023

For some reason got an out of memory error, although fp16 and 8bit adam are enabled, as is gradient checkpointing.

Generating samples:   0% 0/4 [00:15<?, ?it/s]
Traceback (most recent call last):
  File "/content/Dreambooth/train.py", line 1110, in <module>
    main(args)
  File "/content/Dreambooth/train.py", line 1070, in main
    save_weights(global_step)
  File "/content/Dreambooth/train.py", line 977, in save_weights
    images = pipeline(
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 546, in __call__
    image = self.decode_latents(latents)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 341, in decode_latents
    image = self.vae.decode(latents).sample
  File "/usr/local/lib/python3.8/dist-packages/diffusers/models/vae.py", line 605, in decode
    decoded = self._decode(z).sample
  File "/usr/local/lib/python3.8/dist-packages/diffusers/models/vae.py", line 577, in _decode
    dec = self.decoder(z)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/models/vae.py", line 217, in forward
    sample = up_block(sample)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/models/unet_2d_blocks.py", line 1691, in forward
    hidden_states = resnet(hidden_states, temb=None)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/diffusers/models/resnet.py", line 457, in forward
    hidden_states = self.norm1(hidden_states)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/normalization.py", line 273, in forward
    return F.group_norm(
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py", line 2528, in group_norm
    return torch.group_norm(input, num_groups, weight, bias, eps, torch.backends.cudnn.enabled)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 14.76 GiB total capacity; 12.85 GiB already allocated; 397.75 MiB free; 13.05 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Steps:  33% 401/1200 [07:47<15:30,  1.16s/it, Loss/pred=0.0148, lr/text=3.75e-5, lr/unet=1.5e-6]

@brian6091
Copy link
Owner

Are train_batch_size and sample_batch_size both equal to 1? Can you post the args.json output here (it will be in your output_dir). It OOMed at a weird step, so I'm not sure.

@G-force78
Copy link
Author

They were yes, I had already deleted runtime by the time I had seen this so lost my output dir

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants