Skip to content

Block diagram representation of the model #290

Answered by lukas-blecher
jaytxrx asked this question in Q&A
Discussion options

You must be logged in to vote

Hi, sorry for the late reply, I didn't see the question.

The backbone is just a rather simple ResNet as feature extractor (basically just a couple of conv layers with residual connections). The input image is fed into this CNN. The output is a smaller feature map which is then split into patches and fed into the ViT. The architecture is already described in the original ViT paper (https://arxiv.org/pdf/2010.11929.pdf) in section 3.1
The second image (https://production-media.paperswithcode.com/methods/Screen_Shot_2020-07-20_at_9.17.39_PM_ZHS2kmV.png) looks very similar to the architecture.

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by jaytxrx
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants