Skip to content

Implementation of "PaLM2-VAdapter:" from the multi-modal model paper: "PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter"

License

Notifications You must be signed in to change notification settings

kyegomez/PaLM2-VAdapter

Repository files navigation

Multi-Modality

Palm2 Adapter

Implementation of "PaLM2-VAdapter:" from the multi-modal model paper: "PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter".

This model uses a perceiver resampler with a depth of 1 + a tiny palm to efficiently learn the features behind the images and then map them to the same space as the big model.

install

$ pip install palm-vadapter

usage

import torch
from palm_vadapter.main import PaLM2VAdapter

# Random text and image tensors
text = torch.randint(0, 1000, (1, 32), dtype=torch.long)


# Image tensor
img = torch.randn(1, 3, 224, 224)

# Initialize PaLM2VAdapter model
model = PaLM2VAdapter(
    tiny_dim=512,
    dim=512,
    num_tokens=10000,
    seq_length=32,
    depth=6,
    heads=8,
    image_size=224,
    patch_size=16,
)

# Forward pass through the model
out = model(text, img)

# Print the shape of the output
print(out.shape)

License

MIT

Citation

@misc{xiao2024palm2vadapter,
    title={PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter}, 
    author={Junfei Xiao and Zheng Xu and Alan Yuille and Shen Yan and Boyu Wang},
    year={2024},
    eprint={2402.10896},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

Todo

  • Add video processing for every frame

About

Implementation of "PaLM2-VAdapter:" from the multi-modal model paper: "PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter"

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published