NaViT

My implementation of "Patch n’ Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution"

Appreciation

Lucidrains
Agorians

Install

pip install navit-torch

Usage

import torch
from navit.main import NaViT


n = NaViT(
    image_size = 256,
    patch_size = 32,
    num_classes = 1000,
    dim = 1024,
    heads = 16,
    mlp_dim=2048,
    dropout=0.1,
    emb_dropout=0.1,
    token_dropout_prob=0.1
)

images = [
    [torch.randn(3, 256, 256), torch.randn(3, 128, 128)],
    [torch.randn(3, 256, 256), torch.randn(3, 256, 128)],
    [torch.randn(3, 64, 256)]
]

preds = n(images)

Dataset Strategy

Here is a table of the key datasets and their metadata used for pretraining and evaluating NaViT:

Dataset	Type	Size	Details	Source
JFT-4B	Image classification	4 billion images	Private dataset from Google	[1]
WebLI	Image-text	73M image-text pairs	Web-crawled dataset	[2]
ImageNet	Image classification	1.3M images, 1000 classes	Standard benchmark	[3]
ImageNet-A	Image classification	7,500 images	Out-of-distribution variant	[4]
ObjectNet	Image classification	50K images, 313 classes	Out-of-distribution variant	[5]
LVIS	Object detection	120K images, 1000 classes	Large vocabulary instance segmentation	[6]
ADE20K	Semantic segmentation	20K images, 150 classes	Scene parsing dataset	[7]
Kinetics-400	Video classification	300K videos, 400 classes	Action recognition dataset	[8]
FairFace	Face attribute classification	108K images, 9 attributes	Balanced dataset for facial analysis	[9]
CelebA	Face attribute classification	200K images, 40 attributes	Face attributes dataset	[10]

[1] Zhai et al. "Scaling Vision Transformers". 2022. https://arxiv.org/abs/2106.04560
[2] Chen et al. "PaLI". 2022. https://arxiv.org/abs/2209.06794 [3] Deng et al. "ImageNet". 2009. http://www.image-net.org/ [4] Hendrycks et al. "Natural Adversarial Examples". 2021. https://arxiv.org/abs/1907.07174 [5] Barbu et al. "ObjectNet". 2019. https://arxiv.org/abs/1612.03916 [6] Gupta et al. "LVIS". 2019. https://arxiv.org/abs/1908.03195 [7] Zhou et al. "ADE20K". 2017. https://arxiv.org/abs/1608.05442 [8] Kay et al. "Kinetics". 2017. https://arxiv.org/abs/1705.06950 [9] Kärkkäinen and Joo. "FairFace". 2019. https://arxiv.org/abs/1908.04913 [10] Liu et al. "CelebA". 2015. https://arxiv.org/abs/1410.5408

Todo

create example trainining script

License

MIT

Citations

@misc{2307.06304,
Author = {Mostafa Dehghani and Basil Mustafa and Josip Djolonga and Jonathan Heek and Matthias Minderer and Mathilde Caron and Andreas Steiner and Joan Puigcerver and Robert Geirhos and Ibrahim Alabdulmohsin and Avital Oliver and Piotr Padlewski and Alexey Gritsenko and Mario Lučić and Neil Houlsby},
Title = {Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution},
Year = {2023},
Eprint = {arXiv:2307.06304},
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github		.github
navit		navit
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
agorabanner.png		agorabanner.png
example.py		example.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NaViT

Appreciation

Install

Usage

Dataset Strategy

Todo

License

Citations

About

Releases

Sponsor this project

Packages

Contributors 2

Languages

License

kyegomez/NaViT

Folders and files

Latest commit

History

Repository files navigation

NaViT

Appreciation

Install

Usage

Dataset Strategy

Todo

License

Citations

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Contributors 2

Languages

Packages