Skip to content

eps696/SD

Repository files navigation

Stable Diffusion for studies

This is yet another Stable Diffusion compilation, aimed to be functional, clean & compact enough for various experiments. There's no GUI here, as the target audience are creative coders rather than post-Photoshop users. For the latter one may check InvokeAI or AUTOMATIC1111 as a convenient production tool, or Deforum for precisely controlled animations.

The code is based on the CompVis and Stability AI libraries and heavily borrows from this repo, with occasional additions from InvokeAI and Deforum, as well as the others mentioned below. The following codebases are partially included here (to ensure compatibility and the ease of setup): k-diffusion, Taming Transformers, OpenCLIP, CLIPseg. There is also a similar repo, based on the [diffusers] library, which is more logical and up-to-date.

Current functions:

  • Text to image
  • Image re- and in-painting
  • Latent interpolations (with text prompts and images)

Fine-tuning with your images:

Other features:

  • Memory efficient with xformers (hi res on 6gb VRAM GPU)
  • Use of special depth/inpainting and v2 models
  • Masking with text via CLIPseg
  • Weighted multi-prompts
  • to be continued..

More details and Colab version will follow.

Setup

Install CUDA 11.6. Setup the Conda environment:

conda create -n SD python=3.10 numpy pillow 
activate SD
pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
pip install -r requirements.txt

Install xformers library to increase performance. It makes possible to run SD in any resolution on the lower grade hardware (e.g. videocards with 6gb VRAM). If you're on Windows, first ensure that you have Visual Studio 2019 installed.

pip install git+https://github.com/facebookresearch/xformers.git

Download Stable Diffusion (1.5, 1.5-inpaint, 2-inpaint, 2-depth, 2.1, 2.1-v, OpenCLIP, custom VAE, CLIPseg, MiDaS models (mostly converted to float16 for faster loading) by the command below. Licensing info is available on their webpages.

python download.py

Operations

Examples of usage:

  • Generate an image from the text prompt:
python src/_sdrun.py -t "hello world" --size 1024-576
  • Redraw an image with existing style embedding:
python src/_sdrun.py -im _in/something.jpg -t "<line-art>"
  • Redraw directory of images, keeping the basic forms intact:
python src/_sdrun.py -im _in/pix -t "neon light glow" --model v2d
  • Inpaint directory of images with RunwayML model, turning humans into robots:
python src/_sdrun.py -im _in/pix --mask "human, person" -t "steampunk robot" --model 15i
  • Make a video, interpolating between the lines of the text file:
python src/latwalk.py -t yourfile.txt --size 1024-576
  • Same, with drawing over a masked image:
python src/latwalk.py -t yourfile.txt -im _in/pix/bench2.jpg --mask _in/pix/mask/bench2_mask.jpg 

Check other options by running these scripts with --help option; try various models, samplers, noisers, etc.
Text prompts may include either special tokens (e.g. <depthmap>) or weights (like good prompt :1 | also good prompt :1 | bad prompt :-0.5). The latter may degrade overall accuracy though.
Interpolated videos may be further smoothed out with FILM.

There are also Windows bat-files, slightly simplifying and automating the commands.

Fine-tuning

python src/train.py --token mycat1 --term cat --data data/mycat1
python src/train.py --token mycat1 --term cat --data data/mycat1 --reg_data data/cat

Results of the trainings above will be saved under train directory.

Custom diffusion trains faster and can achieve impressive reproduction quality in the simple and similar prompts, but it can entirely lose the point if the prompt is too complex or aside from the original category. Result file is 73mb (can be compressed to ~16mb). Note that in that case you'll need both target reference images (data/mycat1) and more random images of similar subjects (data/cat). Apparently, you can generate the latter with SD itself.
Textual inversion is more generic but stable. Its embeddings can also be easily combined without additional retraining. Result file is ~5kb.

  • Generate image with embedding from textual inversion. You'll need to rename the embedding file as your trained token (e.g. mycat1.pt), and point the path to its directory. Note that the token is hardcoded in the file, so you can't change it afterwards.
python src/_sdrun.py -t "cosmic <mycat1> beast" --embeds train
  • Generate image with embedding from custom diffusion. You'll need to explicitly mention your new token (so you can name it differently here) and path to the trained delta file:
python src/_sdrun.py -t "cosmic <mycat1> beast" --token_mod mycat1 --delta_ckpt train/delta-xxx.ckpt

You can also run python src/latwalk.py ... with such arguments to make animations.

Credits

It's quite hard to mention all those who made the current revolution in visual creativity possible. Check the inline links above for some of the sources. Huge respect to the people behind Stable Diffusion, InvokeAI, Deforum and the whole open-source movement.