Modality Translation through Conditional Encoder-Decoder

Team project in the "Machine Learning for Visual Understanding" (M3224.000100, Spring 2023) course.

Hyunsoo Lee, Yoonsang Lee, Maria Pak, Jinri Kim

Seoul National University

Abstract

With the recent rise of multi-modal learning, many of the developed models are task-specific, and, as a consequence, lack generalizability when applied to other types of downstream tasks. One of the representative models that overcomes this issue of generalizability is CLIP, which attacks downstream tasks using cosine similarity metric. However, CLIP has shown relatively low cosine similarity between text and image vector representations. For this reason, we aim to develop a new approach that more accurately maps the hyperplanes of text and image embeddings, and thus, achieves a high-quality text-image modality translation. To this end, we propose a new conditional encoder-decoder model that maps a latent space of one modality given another modality as a condition. We observe that our model is a general method that can be used with various latent encoders and decoders, which are not limited to multi-modal models. Experiments show that conditional encoder-decoder achieves comparable results with the previous state-of-the-art on several downstream tasks.

Data Preprocessing

MS-COCO

Dataset can be downloaded from here.

Command for training set preprocessing:

python preproc_datasets.py --source=../DATA/mscoco_data/train2014/ \
                           --annotation=../DATA/mscoco_data/annotations/captions_train2014.json \
                           --dest=COCO2014_train_CLIP_ViTL14.zip --width=256 --height=256 \
                           --transform=center-crop --emb_dim 768

The files at directory ../DATA/mscoco_data/ is like:

../DATA/mscoco_data/
  ├── train2014
  │   ├── COCO_train2014_000000000009.jpg
  │   ├── COCO_train2014_0000000000025.jpg
  │   └── ...
  ├── val2014
  │   ├── COCO_val2014_0000000000042.jpg
  │   ├── COCO_val2014_0000000000073.jpg
  │   └── ...
  └── annotations
      ├── captions_train2014.json
      ├── captions_val2014.json
      └── ...

CC3M

Dataset can be downloaded from here.

Command for training set preprocessing:

python preproc_datasets_cc3m.py --source=../DATA/cc3m_train/ \
                                --dest=CC3M_train_CLIP_ViTL14.zip --width=256 --height=256 \
                                --transform=center-crop --emb_dim 768

There are total 996 files in ../DATA/cc3m_train/. Directory structure is like:

../DATA/cc3m_train/
  ├── 00000_stats.json
  ├── 00000.parquet
  ├── 00000.tar
  ├── ...
  └── 00331.tar

Conditional Encoder-Decoder

Training

Command for training conditional encoder-decoder using MS-COCO dataset:

python train_latent_ddim.py --train_data_path "../DATA/COCO2014_train_CLIP_ViTB32.zip" \
                            --val_data_path "../DATA/COCO2014_val_CLIP_ViTB32.zip" \
                            --epochs 100 --batch_size 256 --learning_rate 1e-4 \
                            --log_name "exp1" --target "txt" --use_default_init \
                            --x_dim 512 --condition_dim 512 --lambda_2 2.0 --cfg_prob 0.05 \
                            --checkpoint_path <YOUR CHECKPOINT PATH>

Command for training conditional encoder-decoder using CC3M dataset:

python train_latent_ddim_cc3m.py --train_data_path "../DATA/CC3M_processed/CC3M_train_CLIP_ViTL14" \
                                 --val_data_path "../DATA/CC3M_processed/CC3M_val_CLIP_ViTL14.zip" \
                                 --epochs 100 --batch_size 256 --learning_rate 1e-4 \
                                 --log_name "exp1" --target "txt" --use_default_init \
                                 --x_dim 768 --condition_dim 768 --lambda_2 2.0 --cfg_prob 0.05 \
                                 --checkpoint_path <YOUR CHECKPOINT PATH>

DDP plugin is implemented, however using a single gpu is highly recommended.

Evaluation

Command for evaluating trained conditional encoder-decoder:

python eval_latent_ddim.py --train_data_path "../DATA/COCO2014_train_CLIP_ViTB32.zip" \
                           --val_data_path "../DATA/COCO2014_val_CLIP_ViTB32.zip" \
                           --saved_checkpoint_path <YOUR_CHECKPOINT_PATH> \
                           --batch_size 2048 --target "txt" --use_default_init \
                           --x_dim 768 --condition_dim 768 --lambda_2 2.0 --cfg_prob 0.05

Inference model using text prompt (Generate image embedding)

Command for python :

python sample_z.py  --train_data_path "../DATA/COCO2014_train_CLIP_ViTB32.zip" \
                    --val_data_path "../DATA/COCO2014_val_CLIP_ViTB32.zip" \
                    --cfg_prob 0.05 --use_default_init \
                    --target "img" --checkpoint_path <YOUR_CHECKPOINT_PATH> \
                    --text_query "A white boat floating on a lake under mountains"

Downstream tasks

Text-to-Image Generation
Image Retrieval
Image Captioning
Image Classification

Requirements

See requirements.txt.

pip install -r requirements.txt

Paper

Check in here.

Presentation

TBD

Acknowledgments

This repository is implementation based on Diffusion Autoencoders.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
assets		assets
diffusion		diffusion
model		model
README.md		README.md
choices.py		choices.py
config.py		config.py
config_base.py		config_base.py
dataset.py		dataset.py
dataset_predict_img.py		dataset_predict_img.py
dataset_predict_txt.py		dataset_predict_txt.py
dataset_util.py		dataset_util.py
dist_utils.py		dist_utils.py
eval_latent_ddim.py		eval_latent_ddim.py
image_retrieval.py		image_retrieval.py
preproc_datasets.py		preproc_datasets.py
preproc_datasets_cc3m.py		preproc_datasets_cc3m.py
renderer.py		renderer.py
requirements.txt		requirements.txt
run.sh		run.sh
sample_z.py		sample_z.py
seed_fix.py		seed_fix.py
templates.py		templates.py
templates_latent.py		templates_latent.py
train_latent_ddim.py		train_latent_ddim.py
train_latent_ddim_cc3m.py		train_latent_ddim_cc3m.py
train_loop.py		train_loop.py
train_loop_cc3m.py		train_loop_cc3m.py

Hleephilip/MLVU-project

Folders and files

Latest commit

History

Repository files navigation

Modality Translation through Conditional Encoder-Decoder

Abstract

Data Preprocessing

MS-COCO

CC3M

Conditional Encoder-Decoder

Training

Evaluation

Inference model using text prompt (Generate image embedding)

Downstream tasks

Requirements

Paper

Presentation

Acknowledgments

About

Topics

Resources

Stars

Watchers

Forks

Languages