Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add aspect ratio bucketing to training scripts #7908

Open
jferments opened this issue May 10, 2024 · 6 comments
Open

Add aspect ratio bucketing to training scripts #7908

jferments opened this issue May 10, 2024 · 6 comments

Comments

@jferments
Copy link

Is your feature request related to a problem? Please describe.
When fine tuning SDXL, images are required to be a fixed size (1024x1024) which involves a lot of cropping that both takes time/resources, and often causes important parts of the image to get cropped out, which lowers model quality.

Describe the solution you'd like.
The ideal solution would be a simple option for user to enable aspect ratio bucketing (e.g. a command argument --enable-bucketing) that will let them train with multiple image sizes

@bghira
Copy link
Contributor

bghira commented May 10, 2024

that's really not something that can be added to these scripts without totally rewriting it.

it is the goal of https://github.com/bghira/simpletuner to provide a Diffusers-centric training toolkit that implements aspect bucketing and other optimisations, including data bucketing, pure-bf16 training, multi-gpu support, pre-training embed caching, and more.

@jferments
Copy link
Author

jferments commented May 10, 2024

Thanks! I will take a look at your source code for simpletuner, and see if that helps me understand how to do it. I'm still trying to wrap my head around the concepts surrounding bucketing and size/cropping considerations during training.

It is my understanding that aspect ratio bucketing / size conditioning were at the core of how SDXL was trained in the first place. In the SDXL paper, they say:

"Real-world datasets include images of widely varying sizes and aspect-ratios While the common output resolutions for text-to-image models are square images of 512 x 512 or 1024 x 1024 pixels, we argue that this is a rather unnatural choice, given the widespread distribution and use of landscape (e.g., 16:9) or portrait format screens. Motivated by this, we finetune our model to handle multiple aspect-ratios simultaneously: We follow common practice and partition the data into buckets of different aspect ratios, where we keep the pixel count as close to 1024² pixels as possibly, varying height and width accordingly in multiples of 64. [...] During optimization, a training batch is composed of images from the same bucket, and we alternate between bucket sizes for each training step. Additionally, the model receives the bucket size (or, target size) as a conditioning, represented as a tuple of integers C_ar=(h,w) which are embedded into a Fourier space in analogy to the size- and crop-conditionings described above. In practice, we apply multi-aspect training as a finetuning stage after pretraining the model at a fixed aspect-ratio and resolution and combine it with the conditioning techniques"

... so I am very surprised if there is really no way to easily do this with diffusers training for SDXL.

If it's not possible, then I think that incorporating easy aspect ratio bucketing into diffusers would be a huge benefit to users of the library. It would make dataset management massively easier for anyone who has mixed size/resolution images they want to train SDXL on, and would improve model quality by removing noise introduced from cropping errors.

I would be interested to explore what exactly would need to be changed to make this possible, because it seems to me like kind of a core feature needed to work with SDXL, and a lot of model quality/flexibility is lost by forcing users to crop images into squares.

@bghira
Copy link
Contributor

bghira commented May 11, 2024

it's been asked for (by me, even) but the consensus currently is that the example training scripts are just that - examples, and they can be forked and extended to add these features

the problem with aspect bucketing is that it's not trivial. images have to be the same size in a single batch, and for typical (eg. single subject dreambooth) finetuning on downstream tasks, the aspect buckets just aren't that important - especially for SDXL, which has additional microconditioning inputs at inference time that specify the aspect ratios you want.

for very large training tasks where aspect bucketing makes sense, then you begin to run into scale issues which the example script is not designed for.

  • it will have to actually read in all of the images in order to generate an aspect bucket list which is fine for small datasets, but you don't need it for those.
  • it will have to either store this somewhere or re-do it upon startup for every resumed training run
  • once you get aspect bucketing for the training loop going, you'll almost invariably want to add batching to the vae embed pre-cache task, and then the text encoder inputs. because these will slow down the pre-processing of very large training runs.
  • all of those caches have to be stored to disk somewhere, and kept track of. you'll have to scan at startup to find the differences. if any new images arrived, we have to check that they are scanned.
  • any stored objects on disk will have to have their random crop coordinate augmentation values saved somewhere so that they can be reused effectively

it's a hard problem and at this point i understand why it's not yet solved in the example training scripts. but it doesn't exclude a future project from the team that would essentially create a transformers-like Trainer module which can do these kinds of data pipeline tasks efficiently and reliably.

@jferments
Copy link
Author

jferments commented May 11, 2024

I really appreciate you taking the time to explain the reasoning behind this. There is not a lot of information about how aspect ratio bucketing works available online, so it is hard to understand what challenges one would encounter with it. I would be interested in contributing however I can to helping write training scripts that incorporate a simple bucketing scheme.

I feel like some of the issues you mentioned could be mitigated by just operating on the assumption that the user will not modify the training set between runs (making this clear in documentation/comments), and letting the users deal with caching, etc.

The bucketing script would ONLY deal with the bucketing process - simply scanning through metadata to get image resolutions, and placing into fixed set of buckets based on the ones that were originally used to train SDXL - i.e.:
Screenshot_20240510_201829).

So in short, what if we made the following assumptions:

(a) the user cannot modify training set after buckets are assigned, without breakage
(b) the user handles any caching/state-saving logic on their own, unless they want to redo it with each run
(c) we just use the bucket sizes that were originally used to train SDXL instead of having complex code to determine dataset-specific bucket classes

Would that not make it much simpler to implement a basic bucketing scheme and address your concerns about complexity of handling caching, etc?

You are right that this would still mean running through the entire dataset for each training run, but I don't think it would be that costly to just go through image metadata to extract dimensions. For instance, right now I have a ~750,000 image dataset that I'm trying to use as training data for a full SDXL fine tune. The time that it will take me to scan through and get image dimensions and sort these into buckets will be no more than the cost of having to go through and process all 750k images to convert them into 1024x1024 squares. And if the user wants to, the result of this bucketing process could be easily stored on disk (mapping of files to buckets, cropping/scaling info, etc), with the understanding that if they modify the training dataset afterwards, that the buckets would have to be regenerated.

I am new to this, and am aware that I might be missing something, or that something you're saying might be going over my head. I am just trying to understand the situation better, so that I can hopefully contribute to writing some code that might be helpful to others like myself who just want a very basic bucketing scheme available.

@bghira
Copy link
Contributor

bghira commented May 11, 2024

all of the stuff you describe is already in kohya trainer or simpletuner, and i promise you it's really not something the diffusers project is currently interested in working on. @patil-suraj and @sayakpaul can elaborate

all of the assumptions you want to make end up being really difficult to work with. i know this because simpletuner has options to preserve caches and all of that

@bghira
Copy link
Contributor

bghira commented May 11, 2024

the square crops can be generated on-the-fly and you don't have to scan the whole dataset to know the true image sizes 🤷 because they are all the same aspect ratio, 1.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants