Skip to content

RealCompo: Balancing Realism and Compositionality Improves Text-to-Image Diffusion Models

Notifications You must be signed in to change notification settings

YangLing0818/RealCompo

Repository files navigation

Alt text

This repository contains the official implementation of our training-free text-to-image framework - RealCompo.

RealCompo: Balancing Realism and Compositionality Improves Text-to-Image Diffusion Models
Xinchen Zhang*, Ling Yang*, Yaqi Cai, Zhaochen Yu, Kai-Ni Wang, Jiake Xie, Ye Tian, Minkai Xu, Yong Tang, Yujiu Yang, Bin Cui
Tsinghua University, Peking University, University of Science and Technology of China, Southeast University, PicUp.AI, Stanford University

Click for full abstract Diffusion models have achieved remarkable advancements in text-to-image generation. However, existing models still have many difficulties when faced with multiple-object compositional generation. In this paper, we propose RealCompo, a new training-free and transferred-friendly text-to-image generation framework, which aims to leverage the respective advantages of text-to-image models and spatial-aware image diffusion models (e.g., layout, keypoints and segmentation maps) to enhance both realism and compositionality of the generated images. An intuitive and novel balancer is proposed to dynamically balance the strengths of the two models in denoising process, allowing plug-and-play use of any model without extra training. Extensive experiments show that our RealCompo consistently outperforms state-of-the-art text-to-image models and spatial-aware image diffusion models in multiple-object compositional generation while keeping satisfactory realism and compositionality of the generated images. Notably, our RealCompo can be seamlessly extended with a wide range of spatial-aware image diffusion models and stylized diffusion models.

Introduction

Alt text

We introduce a new training-free and transferred-friendly text-to-image generation framework RealCompo that utilizes a novel balancer to achieve dynamic equilibrium between realism and compositionality in generated images.

New Updates

[2024.5] Our main code of style-based and keypoint-based RealCompo is released.

[2024.2] Our main code of layout-based RealCompo is released.

TODO

  • Release layout-based RealCompo
  • Release style-based RealCompo
  • Release keypoint-based RealCompo
  • Release segmentation-based RealCompo

Gallery

Qualitative comparison between our RealCompo and the outstanding text-to-image model Stable Diffusion v1.5, as well as the layout-to-image models, GLIGEN and LMD+. Colored text denotes the advantages of RealCompo in generating results.
Extend RealCompo to keypoint- and segmentation-based text-to-image generation.
Extend RealCompo to stylized compositional generation.
Qualitative comparison of RealCompo's generalization to different models: We select two T2I models: Stable Diffusion v1.5, TokenCompose, two L2I models GLIGEN, Layout Guidance (LayGuide), and combine them in pairs to obtain four versions of RealCompo. We demonstrate that RealCompo has strong generalization and generality to different models, achieving a remarkable level of both fidelity and precision in aligning with text prompts.

Installation

git clone https://github.com/YangLing0818/RealCompo
cd RealCompo
conda create -n RealCompo python==3.8.10
conda activate RealCompo
pip install -r requirements.txt

Download Models

We provide the code of RealCompo v1, which is composed of Stable Diffusion v1.5 and GLIGEN.

You should download the checkpoints of GLIGEN (HF Hub) put its path into inference_layout.py.

Generating images with Layout-based RealCompo

Option 1: Use LLMs to reason out the layout

You can get the results through running:

python inference_layout.py --user_prompt 'Two cute small corgi sitting in a movie theater with two popcorns in front of them.' --api_key 'put your api_key here' 

--user_prompt is the original prompt that used to generate a image.

--api_key is needed if you use GPT-4.

You can also use local LLMs to reason out layouts. Example samples will be saved in generation_samples. You can check inference_layout.py for more details about interface.

generation_samples
├── generation_realcompo_v1_sd_gligen_two_cute_small_corgi_sitting_in_a_movie_theater_
│   ├── 0.png
│   ├── 1.png
|   .....
......

Option 2: Manually setting the layout

If you already have the layouts related to all objects, you can directly run:

python inference_layout.py  --no_gpt --user_prompt 'Two cute small corgi sitting in a movie theater with two popcorns in front of them.' --object "['a cute small corgi', 'a cute small corgi', 'a movie theater', 'popcorn', 'popcorn']" --boundingbox "[[0.05, 0.05, 0.52, 0.58], [0.52, 0.05, 1.0, 0.58], [0.0, 0.0, 1, 1], [0.0, 0.6, 0.48, 0.95], [0.52, 0.6, 1, 0.95]]" --token_location "[4, 4, 9, 12, 12]"

--no_gpt can be used when you have already obtained the layout.

--object represents the set of objects mentioned in the prompt.

--boundingbox represents the set of layout for each object.

--token_location represents the set of locations where each object appears in the prompt.

You can change the backbone of the T2I model to Stable Diffusion v1.4, TokenCompose, and so on.

The core code for updating the models' coefficients is located in ldm/models/diffusion/plms.py. Using this code, you can make slight modifications to replace the L2I model with another one.

Generating images with Style-based RealCompo

You can use RealCompo to achieve stylized compositional generation by running:

 python inference_layout.py  --no_gpt --style 'coloring-pages' --user_prompt 'Coloring page of a car park in front of a house.' --object "['a car', 'a house']" --boundingbox "[[0.0, 0.5, 1.0, 0.9], [0.1, 0.0, 1.0, 0.6]]" --token_location "[5, 11]"

--style represents the style you want for generation.

In this code, we provide two styles: 'coloring-pages' and 'cuteyukimix'.

You can find more stylized T2I backbones in Civitai.

Generating images with Keypoint-based RealCompo

python inference_keypoint.py --user_prompt 'Elsa and Anna, sparks of magic between them, princess dress, background with sparkles, black purple red color schemes.' --token_location "[1, 3]"

--user_prompt is the original prompt that used to generate a image.

--token_location represents the set of locations where each object appears in the prompt.

You can change the backbone of the T2I model to various SDXL-based models.

Citation

@article{zhang2024realcompo,
  title={RealCompo: Balancing Realism and Compositionality Improves Text-to-Image Diffusion Models},
  author={Zhang, Xinchen and Yang, Ling and Cai, Yaqi and Yu, Zhaochen and Wang, Kaini and Xie, Jiake and Tian, Ye and Xu, Minkai and Tang, Yong and Yang, Yujiu and Cui, Bin},
  journal={arXiv preprint arXiv:2402.12908},
  year={2024}
}

Acknowledgements

This repo uses some codes from GLIGEN and LLM-groundedDiffusion. Thanks for their wonderful work and codebase!