Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add convnext encoder, pytorch transformer decoder #162

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

rainyl
Copy link
Contributor

@rainyl rainyl commented Jun 14, 2022

I got higher scores on the dataset which I built myself, but it seems no much improvement on the dataset you provided, so the main bottleneck may be the dataset itself. Anyway I decided to open the PR to help the project more stronger.

@lukas-blecher
Copy link
Owner

Thank you very much!
I will look into it in the next week or two and report back to you.
In the meanwhile, may I ask what kind of data you generated?

@rainyl
Copy link
Contributor Author

rainyl commented Jun 15, 2022

In fact, I just re-check the formulas and correct them, along with some data augmentation methods. For example, removing the space control commands like \hspace and \vspace, because it is hard to measure the exact space.

@lukas-blecher
Copy link
Owner

I'm wondering how this is even working without the proper positional embedding in the encoder (#130).
Maybe the max_dimensions are large enough that most images don't actually need positional encoding.

@rainyl
Copy link
Contributor Author

rainyl commented Jun 24, 2022

The positional embedding is designed for VIT, CNN's feature extract method is not like Transformer (but they may have similar theory in some aspect).
In fact, the encoder is just a process of feature extraction, the difference is the extraction method.

@lukas-blecher
Copy link
Owner

I know what you mean. But my understanding is that adding the positional information will stabilize the performance.
I didn't calculate it but I believe the receptive field of the CNN is not large enough to cover the images completely.
In the end, you also flatten the feature map and destroy any implicit positional information. At that point I would reintroduce this information through a positional embedding.

@rainyl
Copy link
Contributor Author

rainyl commented Jun 30, 2022

Well, I agree that adding position information may do have some help for performance, but I have no time to test, it will be nice if you had time to do it. But according to this research, position information added on CNN can both help or hurt the performance.

It is true that Transformer has a larger receptive field, but it doesn't means it's performance will be better. At least, ConvNext's (with convolution kernel size = 7) performance is better than VIT even than Swin Transformer according to their paper (link here).

image

Actually, it is also controversial about the transformer's success, some researchers think it was the success of the global receptive field (or attention), while others think it was the architecture's success (the convnext's architecture was designed according to transformer) even the patches' (here is a paper talked about it).

@lukas-blecher
Copy link
Owner

Thank you for the input and the papers. I will have a look.
I also started the experiment. Will keep you updated.

@rainyl
Copy link
Contributor Author

rainyl commented Jul 1, 2022

Great! I am also curious about it, but i just have no much time :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants