add convnext encoder, pytorch transformer decoder #162

rainyl · 2022-06-14T01:57:06Z

I got higher scores on the dataset which I built myself, but it seems no much improvement on the dataset you provided, so the main bottleneck may be the dataset itself. Anyway I decided to open the PR to help the project more stronger.

lukas-blecher · 2022-06-14T18:17:31Z

Thank you very much!
I will look into it in the next week or two and report back to you.
In the meanwhile, may I ask what kind of data you generated?

rainyl · 2022-06-15T03:00:28Z

In fact, I just re-check the formulas and correct them, along with some data augmentation methods. For example, removing the space control commands like \hspace and \vspace, because it is hard to measure the exact space.

lukas-blecher · 2022-06-23T17:37:00Z

I'm wondering how this is even working without the proper positional embedding in the encoder (#130).
Maybe the max_dimensions are large enough that most images don't actually need positional encoding.

rainyl · 2022-06-24T11:32:32Z

The positional embedding is designed for VIT, CNN's feature extract method is not like Transformer (but they may have similar theory in some aspect).
In fact, the encoder is just a process of feature extraction, the difference is the extraction method.

lukas-blecher · 2022-06-28T10:06:57Z

I know what you mean. But my understanding is that adding the positional information will stabilize the performance.
I didn't calculate it but I believe the receptive field of the CNN is not large enough to cover the images completely.
In the end, you also flatten the feature map and destroy any implicit positional information. At that point I would reintroduce this information through a positional embedding.

rainyl · 2022-06-30T05:38:30Z

Well, I agree that adding position information may do have some help for performance, but I have no time to test, it will be nice if you had time to do it. But according to this research, position information added on CNN can both help or hurt the performance.

It is true that Transformer has a larger receptive field, but it doesn't means it's performance will be better. At least, ConvNext's (with convolution kernel size = 7) performance is better than VIT even than Swin Transformer according to their paper (link here).

Actually, it is also controversial about the transformer's success, some researchers think it was the success of the global receptive field (or attention), while others think it was the architecture's success (the convnext's architecture was designed according to transformer) even the patches' (here is a paper talked about it).

lukas-blecher · 2022-06-30T14:44:25Z

Thank you for the input and the papers. I will have a look.
I also started the experiment. Will keep you updated.

rainyl · 2022-07-01T03:52:34Z

Great! I am also curious about it, but i just have no much time :)

add convnext encoder, pytorch transformer decoder

b66486f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add convnext encoder, pytorch transformer decoder #162

add convnext encoder, pytorch transformer decoder #162

rainyl commented Jun 14, 2022

lukas-blecher commented Jun 14, 2022

rainyl commented Jun 15, 2022

lukas-blecher commented Jun 23, 2022

rainyl commented Jun 24, 2022

lukas-blecher commented Jun 28, 2022

rainyl commented Jun 30, 2022 •

edited

lukas-blecher commented Jun 30, 2022

rainyl commented Jul 1, 2022

add convnext encoder, pytorch transformer decoder #162

Are you sure you want to change the base?

add convnext encoder, pytorch transformer decoder #162

Conversation

rainyl commented Jun 14, 2022

lukas-blecher commented Jun 14, 2022

rainyl commented Jun 15, 2022

lukas-blecher commented Jun 23, 2022

rainyl commented Jun 24, 2022

lukas-blecher commented Jun 28, 2022

rainyl commented Jun 30, 2022 • edited

lukas-blecher commented Jun 30, 2022

rainyl commented Jul 1, 2022

rainyl commented Jun 30, 2022 •

edited