Online Demo

📖 Paper ｜ 🤗 Demo | 🤖 ModelScope | Checkpoints ｜ Datasets

ONE-PEACE is a general representation model across vision, audio, and language modalities, Without using any vision or language pretrained model for initialization, ONE-PEACE achieves leading results in vision, audio, audio-language, and vision-language tasks. Furthermore, ONE-PEACE possesses a strong emergent zero-shot retrieval capability, enabling it to align modalities that are not paired in the training data.

Below shows the architecture and pretraining tasks of ONE-PEACE. With the scaling-friendly architecture and modality-agnostic tasks, ONE-PEACE has the potential to expand to unlimited modalities.

Online Demo

We provide the online demo in Huggingface Spaces. In this demo, you can combine multiple modalities to retrieve related images, such as audio-to-image, audio+text-to-image, audio+image-to-image, and even audio+image+text-to-image.

News

2023.7.20: Released the visual grounding API, you can use it to locate objects from the picture.
2023.6.23: Released vision tasks fine-tuning scripts and checkpoints. See guidance for vision tasks for more details.
2023.6.04: Released the pretraining scripts. See guidance for pretraining for more details.
2023.5.30: Released the finetuned checkpoints and scripts for audio(-language) tasks.
2023.5.29: Released the finetuned checkpoints for vision-language tasks.
2023.5.27: 🔥 We have provided the multimodal retrieval demo in huggingface spaces. Have Fun!
2023.5.25: Released the multimodal embedding API, which enables the quick extraction for image, audio and text representations.
2023.5.23: Released the pretrained checkpoint, as well as finetuning & inference scripts for vision-language tasks.
2023.5.19: Released the paper and code. Pretrained & finetuned checkpoints, training & inference scripts, as well as demos will be released as soon as possible.

Models and Results

Model Card

We list the parameters and pretrained checkpoints of ONE-PEACE below. Note that ONE-PEACE can be disassembled into different branches to handle different tasks. We also provide the vision-branch of ONE-PEACE, which can be used to perform vision tasks.

Model	Ckpt	Params	Hidden size	Intermediate size	Attention heads	Layers
ONE-PEACE	Download	4B	1536	6144	24	40
ONE-PEACE (Vision Branch)	Download	1.5B	1536	6144	24	40

Results

Vision Tasks

Task	Image classification	Semantic Segmentation	Object Detection (w/o Object365)	Video Action Recognition
Dataset	Imagenet-1K	ADE20K	COCO	Kinetics 400
Split	val	val	val	val
Metric	Acc.	mIoU^ss / mIoU^ms	AP^box / AP^mask	Top-1 Acc. / Top-5 Acc.
ONE-PEACE	89.8	62.0 / 63.0	60.4 / 52.9	88.1 / 97.8

Audio Tasks

Task	Audio-Text Retrieval				Audio Classification			Audio Question Answering
Dataset	AudioCaps		Clotho		ESC-50	FSD50K	VGGSound (Audio-Visual)	AVQA
Split	test		evaluation		full	eval	test	val
Metric	T2A R@1	A2T R@1	T2A R@1	A2T R@1	Zero-shot Acc.	MAP	Acc.	Acc.
ONE-PEACE	42.5	51.0	22.4	27.1	91.8	69.7	68.2	92.2

Vision-Language Tasks

Task	Image-Text Retrieval (w/o ranking)				Visual Grounding			VQA	Visual Reasoning
Dataset	COCO		Flickr30K		RefCOCO	RefCOCO+	RefCOCOg	VQAv2	NLVR2
Split	test		test		val / testA / testB	val / testA / testB	val-u / test-u	test-dev / test-std	dev / test-P
Metric	I2T R@1	T2I R@1	I2T R@1	T2I R@1	[email protected]			Acc.	Acc.
ONE-PEACE	84.1	65.4	97.6	89.6	92.58 / 94.18 / 89.26	88.77 / 92.21 / 83.23	89.22 / 89.27	82.6 / 82.5	87.8 / 88.3

Requirements and Installation

3.6 <= Python <=3.10
Pytorch >= 1.10.0 (recommend 1.13.1)
CUDA Version >= 10.2 (recommend 11.6)
Install required packages:

git clone https://github.com/OFA-Sys/ONE-PEACE
cd ONE-PEACE
pip install -r requirements.txt

For faster training install Apex library (optional):

git clone https://github.com/NVIDIA/apex
cd apex && pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--distributed_adam" --global-option="--deprecated_fused_adam" ./

Install Xformers library to use Memory-efficient attention (optional):

conda install xformers -c xformers

Install FlashAttention library to use faster LayerNorm (optional):

git clone --recursive https://github.com/HazyResearch/flash-attention
cd flash-attention && pip install .
cd csrc/layer_norm && pip install .

Datasets and Checkpoints

See datasets.md and checkpoints.md.

Usage

API

We provide a simple code snippet to show how to use the API for ONE-PEACE.

Multi-modal Embedding

We use ONE-PEACE to compute embeddings for text, images, and audio, as well as their similarities:

import torch
from one_peace.models import from_pretrained

device = "cuda" if torch.cuda.is_available() else "cpu"
# "ONE-PEACE" can also be replaced with ckpt path
model = from_pretrained("ONE-PEACE", device=device, dtype="float32")

# process raw data
src_tokens = model.process_text(["cow", "dog", "elephant"])
src_images = model.process_image(["assets/dog.JPEG", "assets/elephant.JPEG"])
src_audios, audio_padding_masks = model.process_audio(["assets/cow.flac", "assets/dog.flac"])

with torch.no_grad():
    # extract normalized features
    text_features = model.extract_text_features(src_tokens)
    image_features = model.extract_image_features(src_images)
    audio_features = model.extract_audio_features(src_audios, audio_padding_masks)

    # compute similarity
    i2t_similarity = image_features @ text_features.T
    a2t_similarity = audio_features @ text_features.T

print("Image-to-text similarities:", i2t_similarity)
print("Audio-to-text similarities:", a2t_similarity)

Visual Grounding

We use ONE-PEACE to perform visual grounding on anime pictures:

import torch
import cv2
from one_peace.models import from_pretrained

device = "cuda" if torch.cuda.is_available() else "cpu"
model = from_pretrained(
	"ONE-PEACE_Grounding",
    model_type="one_peace_classify",
    device=device,
    dtype="float32"
)

# process raw data
image_text_list = [
    ("assets/pokemons.jpg", "a blue turtle-like pokemon with round head"),
    ("assets/pokemons.jpg", "Bulbasaur"),
    ("assets/pokemons.jpg", "Charmander"),
    ("assets/pokemons.jpg", "Squirtle"),
    ("assets/one_piece.jpeg", "Brook"),
    ("assets/one_piece.jpeg", "Franky"),
    ("assets/one_piece.jpeg", "Monkey D. Luffy"),
    ("assets/one_piece.jpeg", "Nami"),
    ("assets/one_piece.jpeg", "Nico Robin"),
    ("assets/one_piece.jpeg", "Roronoa Zoro"),
    ("assets/one_piece.jpeg", "Tony Tony Chopper"),
    ("assets/one_piece.jpeg", "Usopp"),
    ("assets/one_piece.jpeg", "Vinsmoke Sanji"),
]
(src_images, image_widths, image_heights), src_tokens  = model.process_image_text_pairs(
    image_text_list, return_image_sizes=True
)

with torch.no_grad():
    # extract features
    vl_features = model.extract_vl_features(src_images, src_tokens).sigmoid()
    # extract coords
    vl_features[:, ::2] *= image_widths.unsqueeze(1)
    vl_features[:, 1::2] *= image_heights.unsqueeze(1)
    coords = vl_features.cpu().tolist()

# display results
for i, image_text_pair in enumerate(image_text_list):
    image, text = image_text_pair
    img = cv2.imread(image)
    cv2.rectangle(
        img,
        (int(coords[i][0]), int(coords[i][1])),
        (int(coords[i][2]), int(coords[i][3])),
        (0, 255, 0),
        3
    )
    cv2.imshow(text, img)
    cv2.waitKey(3500)
    cv2.destroyAllWindows()

Audio Classification

We use ONE-PEACE to perform audio classification:

import torch
import json
from one_peace.models import from_pretrained

id2label = json.load(open("assets/vggsound_id2label.json"))

device = "cuda" if torch.cuda.is_available() else "cpu"
model = from_pretrained(
  "ONE-PEACE_VGGSound",
    model_type="one_peace_classify",
    device=device,
    dtype="float32"
)

# process audio
audio_list = ["assets/cow.flac", "assets/dog.flac"]
src_audios, audio_padding_masks = model.process_audio(audio_list)

with torch.no_grad():
    # extract audio features
    audio_logits = model.extract_audio_features(src_audios, audio_padding_masks)
    print(audio_logits.size())
    predict_label_ids = audio_logits.argmax(1).cpu().tolist()

for audio, predict_label_id in zip(audio_list, predict_label_ids):
    predict_label = id2label[str(predict_label_id)]
    print('audio: {}, predict label: {}'.format(audio, predict_label))

Training & Inference

If you are not satisfied with only using the API, we offer comprehensive training and inference instructions for audio & multimodal and vision tasks.

Gallery

Visual Grounding (unseen domain)

Emergent Zero-shot Retrieval

Acknowledgement

Fairseq A sequence modeling toolkit with flexible configuration and highly extensible code structure.
xFormers A toolbox to accelerate research on Transformers.
FlashAttention A repository that provides the official implementation of FlashAttention, which greatly speeds up multi-head attention.
Apex A repository that provides useful model acceleration and memory optimization techniques.

Getting Involved

Feel free to submit GitHub issues or pull requests. Welcome to contribute to our project!

To contact us, never hestitate to send an email to [email protected] or [email protected]!

Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝 :)

@article{wang2023one,
  title={ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities},
  author={Wang, Peng and Wang, Shijie and Lin, Junyang and Bai, Shuai and Zhou, Xiaohuan and Zhou, Jingren and Wang, Xinggang and Zhou, Chang},
  journal={arXiv preprint arXiv:2305.11172},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
assets		assets
fairseq		fairseq
one_peace		one_peace
one_peace_vision		one_peace_vision
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
checkpoints.md		checkpoints.md
datasets.md		datasets.md
requirements.txt		requirements.txt

License

OFA-Sys/ONE-PEACE

Folders and files

Latest commit

History

Repository files navigation

Online Demo

News

Models and Results

Model Card

Results

Vision Tasks

Audio Tasks

Vision-Language Tasks

Requirements and Installation

Datasets and Checkpoints

Usage

API

Multi-modal Embedding

Visual Grounding

Audio Classification

Training & Inference

Gallery

Visual Grounding (unseen domain)

Emergent Zero-shot Retrieval

Acknowledgement

Getting Involved

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages