-
-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Error when starting training #275
Comments
Your diffusers version looks wrong. The output of "pip freeze" in your venv should be something like:
How did you actually install OneTrainer? |
Wow that was a fast reply, thanks for that! I installed it by cloning the repository (git clone https://github.com/Nerogar/OneTrainer.git) and then running the install.bat file. I've also tried installing it by a 1-click installer (StabilityMatrix) and same issue arrises. The requirements-global.txt file does state the same line you've shared ("-e git+https://github.com/huggingface/diffusers.git@5d848ec#egg=diffusers"), but I'm not quite sure how to fix this as each time I install it seems to end up as "diffusers==0.27.2" in pip freeze. Any thoughts? EDIT:Nevermind, managed to install the diffusers the proper way (in case anyone's got the same issue, I just activated the venv in the OneTrainer folder, ran "pip uninstall diffusers", then ran "pip install -e git+https://github.com/huggingface/diffusers.git@5d848ec#egg=diffusers". Although this fixes the pip freeze (which now contains "-e git+https://github.com/huggingface/diffusers.git@5d848ec#egg=diffusers" instead of "diffusers==0.27.2", I still get a similar error when I try to run the training process: Error log
|
The "pip freeze" still looks suspect to me. Is that the pip freeze from inside the venv? I'm seeing stuff there that wouldn't have been installed by the requirements.txt, like triton on Windows. |
For reference, this is my own "venv\scripts\pip.exe freeze". Note the different torch version, which might also be your problem:
|
Again, thanks for the reply mx, appreciate the help! Sorry, I did take the pip freeze in my first post from the incorrect place. I took it again in the correct location and had a bunch of libraries missing compared to your list, so I've just installed the missing ones manually. My new list is now of that below, and it matches yours exactly. The same error still persists though. Short of doing a full windows reinstall, is there anything else I could do?
|
Can you try using a different checkpoint? I'm wondering if what's happening here is just that the checkpoint you are using got corrupted somewhere along the line and now it just can't be loaded. |
Just tried 3 other SDXL models, exact same error message (just with the updated model names), |
I'm honestly not sure what's going off here. I haven't seen this with any other person using SDXL; your config has no obvious red flags to me. The only place you've specified the model is in the correct place. Can you upload the very latest config.json you're using (made by clicking the "Export" button) from the latest SDXL model you tried? I doubt it, but maybe looking at two different configs will make something click. |
Config file exported from the GUI here: |
I'm having the same issue. It happens with every custom checkpoint, and it started happening after the commit 52520c6 ("Merge branch 'universal_embeddings'", 2024-04-16) It seems like it's trying to pull v1-inference.yaml from github over and over (instead of using the local model_config like it used to), and if it can't get it it'll crash with this error.
I was able to temporarily get it working by disabling the firewall, but it crashes again every time you start training without giving OneTrainer unlimited internet access. I'm not 100% sure it's the exact same issue, but it does seem to be related. |
Nice debugging, and nice find! Talk about an unusual combination of circumstances. Fix forthcoming. |
@supermachine77 can OneTrainer access the internet on your machine? This might not be the exact same issue, but it could be related. And do you know if the issue existed before, or is it something new? I've analyzed this a bit, and I'm a bit confused. The safetensors files don't include a tokenizer. So to load them, a tokenizer needs to be downloaded from the internet first. By default, it's trying to use the one from huggingface called "openai/clip-vit-large-patch14". If you already downloaded that into your huggingface cache, it will use the cached version. I don't think this has ever worked in the past without having at least some kind of internet connection. |
@mookiexl I couldn't reproduce your problem. It's always loading the local v1-inference.yaml file. Can you attach a debugger, or put some print statements in your code? Specifically, I'm interested in If this is None, it will cause |
Ok, so apparently it happens only when training a LoRA, finetune works. Clean install from scratch, clean VENV, changed preset to "SD 1.5 LoRA", everything else left at default. -Using default runwayml/stable-diffusion-v1-5, firewall disabled: works. model.sd_config_filename is "None" (in StableDiffusionLoRAModelLoader.py) |
With LoRA training I can reproduce the issue. Let me think about a solution. |
Just noticed that @supermachine77 seems to be doing finetuning, based on "training_method": "FINE_TUNE", so there could also be another separate problem that also leads to the same "ValueError: Calling CLIPTokenizer.from_pretrained() with the path to a single file or url ..." error? |
@mookiexl your issue should be solved now |
Yes, it seems to work for me now. Thanks. |
@supermachine77 Did the above fixes resolve your issue? Does it still happen at the latest HEAD? |
@mx - will test it and let you know later today, fingers crossed it's resolved! UpdatePulled the latest update (via the update.bat file) and unfortunately the exact same error message. I've also tried a fresh pull of the latest version into a new folder and no luck there either. Regarding internet access, yup, OneTrainer should be able to access it fine. I've tried running it with my firewall turned off but exact same issue. Anything else I could try sharing to make the debugging process a little easier? |
Found this thread as I'm having the same issue with loading models.
Models that work (non-inpainting) Does this look like the same issue or something different?
|
I'm getting an error each time I try to run the training process. For context, I'm using Windows 11 with an RTX 3090, and both A111 and ComfyUI work fine on my machine.
OneTrainer installs fine without any errors, but I get an error when I try to start the actual training process inside the GUI. I've tried reinstalling it but doesn't fix the issue. I've tried upgrading transformers from 4.36.2 to 4.40.1 (per the attached pip freeze log) but that doesn't seem to fix it either, so doubt the issue is related to that. Would really appreciate some guidance as I've tried searching past bug reports and also searched google, but can't seem to find anything concrete that helps.
Config
I'm using the following config:
Error log output
Output of
pip freeze
absl-py==2.1.0
accelerate==0.27.2
aiofiles==23.2.1
aiohttp==3.9.5
aiosignal==1.3.1
altair==5.3.0
annotated-types==0.6.0
annoy-fixed==1.16.3
antlr4-python3-runtime==4.9.3
anyio==4.3.0
appdirs==1.4.4
astunparse==1.6.3
async-timeout==4.0.3
attrs==23.2.0
beautifulsoup4==4.12.2
bidict==0.23.1
bitsandbytes==0.43.0
blinker==1.7.0
braceexpand==0.1.7
cachetools==5.3.3
certifi==2024.2.2
charset-normalizer==3.3.2
clean-fid==0.1.35
click==8.1.7
clip-anytorch==2.6.0
colorama==0.4.6
coloredlogs==15.0.1
contourpy==1.2.1
cycler==0.12.1
dadaptation==3.1
dctorch==0.1.2
decorator==4.4.2
diffusers==0.27.2
distlib==0.3.8
distro==1.9.0
docker-pycreds==0.4.0
easydict==1.10
easygui==0.98.3
einops==0.7.0
einops-exts==0.0.4
entrypoints==0.4
exceptiongroup==1.2.1
face-alignment==1.4.1
facexlib==0.3.0
fairscale==0.4.13
faiss-cpu==1.7.4
fastapi==0.110.2
ffmpeg-progress-yield==0.7.8
ffmpy==0.3.2
filelock==3.13.4
filetype==1.2.0
filterpy==1.4.5
Flask==2.3.2
Flask-SocketIO==5.3.4
flatbuffers==24.3.25
fonttools==4.51.0
frozenlist==1.4.1
fsspec==2024.3.1
ftfy==6.2.0
gast==0.5.4
gitdb==4.0.11
GitPython==3.1.43
google-pasta==0.2.0
gradio==4.19.0
gradio_client==0.10.0
gradio_imageslider==0.0.20
grpcio==1.62.2
h11==0.14.0
h5py==3.11.0
httpcore==1.0.5
httpx==0.27.0
huggingface-hub==0.22.2
humanfriendly==10.0
idna==3.7
imageio==2.34.1
imageio-ffmpeg==0.4.9
imagesize==1.4.1
importlib_metadata==7.1.0
importlib_resources==6.4.0
invisible-watermark==0.2.0
itsdangerous==2.1.2
Jinja2==3.1.3
joblib==1.4.0
jsonmerge==1.9.2
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
k-diffusion==0.1.1.post1
keras==3.2.1
kiwisolver==1.4.5
kornia==0.7.1
lazy_loader==0.4
libclang==18.1.1
lightning-utilities==0.11.2
lion-pytorch==0.0.6
llvmlite==0.42.0
lycoris_lora==2.2.0.post3
Markdown==3.5.2
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.8.3
mdurl==0.1.2
ml-dtypes==0.3.2
moviepy==1.0.3
mpmath==1.3.0
multidict==6.0.5
namex==0.0.8
networkx==3.3
ninja==1.11.1.1
numba==0.59.1
numpy==1.26.4
nvidia-ml-py==12.535.161
nvitop==1.3.2
omegaconf==2.3.0
onnx==1.15.0
onnxruntime-gpu==1.17.1
open-clip-torch==2.24.0
openai==1.3.3
openai-clip==1.0.1
opencv-python==4.9.0.80
opt-einsum==3.3.0
optree==0.11.0
orjson==3.10.1
packaging==24.0
pandas==2.2.1
pathtools==0.1.2
pillow==10.2.0
platformdirs==3.11.0
prodigyopt==1.0
proglog==0.1.10
protobuf==4.25.3
psutil==5.9.8
pydantic==2.7.0
pydantic_core==2.18.1
pydub==0.25.1
Pygments==2.17.2
pyparsing==3.1.2
pypdfium2==4.27.0
pyreadline3==3.4.1
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-engineio==4.9.0
python-multipart==0.0.9
python-socketio==5.11.2
pytorch-lightning==2.2.2
pytz==2024.1
PyWavelets==1.6.0
PyYAML==6.0.1
referencing==0.34.0
regex==2024.4.16
requests==2.31.0
rich==13.7.1
rpds-py==0.18.0
ruff==0.4.1
safetensors==0.4.3
scikit-image==0.23.2
scikit-learn==1.4.1.post1
scipy==1.12.0
semantic-version==2.10.0
semantra==0.1.8
sentencepiece==0.2.0
sentry-sdk==1.45.0
setproctitle==1.3.3
shellingham==1.5.4
simple-websocket==1.0.0
six==1.16.0
smmap==5.0.1
sniffio==1.3.1
soupsieve==2.5
starlette==0.37.2
sympy==1.12
tenacity==8.2.2
tensorboard==2.16.2
tensorboard-data-server==0.7.2
tensorflow==2.16.1
tensorflow-intel==2.16.1
tensorflow-io-gcs-filesystem==0.31.0
termcolor==2.4.0
threadpoolctl==3.4.0
tifffile==2024.4.18
tiktoken==0.4.0
timm==0.9.16
tk==0.1.0
tokenizers==0.19.1
toml==0.10.2
tomlkit==0.12.0
toolz==0.12.1
torch==2.2.0+cu121
torchaudio==2.2.0+cu121
torchdiffeq==0.2.3
torchmetrics==1.3.2
torchsde==0.2.6
torchvision==0.17.0+cu121
tqdm==4.66.2
trampoline==0.1.2
transformers==4.40.1
triton @ https://huggingface.co/MonsterMMORPG/SECourses/resolve/main/triton-2.1.0-cp310-cp310-win_amd64.whl
typer==0.12.3
typing_extensions==4.11.0
tzdata==2024.1
urllib3==2.2.1
uvicorn==0.28.0
virtualenv==20.23.0
voluptuous==0.13.1
wandb==0.16.4
wcwidth==0.2.13
webdataset==0.2.86
websockets==11.0.3
Werkzeug==2.3.6
Wikipedia-API==0.6.0
windows-curses==2.3.2
wrapt==1.16.0
wsproto==1.2.0
xformers==0.0.24
yarl==1.9.4
zipp==3.18.1
The text was updated successfully, but these errors were encountered: