Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect feature map extraction #138

Open
rmcavoy opened this issue Mar 29, 2020 · 7 comments
Open

Incorrect feature map extraction #138

rmcavoy opened this issue Mar 29, 2020 · 7 comments

Comments

@rmcavoy
Copy link

rmcavoy commented Mar 29, 2020

One of the large differences between this and the original implementation is this implementation follows the diagram in the original paper and extracts feature levels 3, 4, 5, 6 and 7 by changing level 5 and 7 to have a stride of 2. That diagram is essentially a lie as the original implementation uses a unmodified EfficientNet and extracts feature levels 3,5 and 7 and then uses a single 1x1 conv2d and a maxpool to generate a level "8" and a maxpool on the level 8 to extract a level "9".

@dkoguciuk
Copy link

@rmcavoy, could you please elaborate? What do you understand by levels 3, 5 and 7?

I've read the original implementation and they have both 1x1 conv2d and maxpool on both additional levels (P6 and P7) as far as I understand.

@rmcavoy
Copy link
Author

rmcavoy commented Mar 30, 2020

If you were to print out all the feature levels/blocks in the EfficientNet, there are 7 of them and we are extracting the 3rd, 5th and 7th EfficientNet feature level/block. This is also the terminology used in the paper.

Also the extra level P9 (P7 is a weird term because there is already an actual 7th level) does not have a 1x1 because the 1x1 there only applies if the beginning and end will have a different number of channels. P9 is derived from P8 which already has the target number of channels and therefore the 1x1 conv2d never applies because the number of channels is already correct.

@dkoguciuk
Copy link

dkoguciuk commented Mar 30, 2020

I see your point. In the EfficientNet paper, they call it stages (Table 1), so I believe you're referring to the 4th, 6th and 8th stages which are then treated as P3, P4, and P5 levels in the BiFPN network. But I still don't understand why do you think they lied in the paper? According to Table 1 from EfficinetNet:

Feature size dim after stage 4 (level P3 in the BiFPN): [input_size/8, input_size/2^3, 40]
Feature size dim after stage 6 (level P4 in the BiFPN): [input_size/16, input_size/2^4, 112]
Feature size dim after stage 8 (level P5 in the BiFPN): [input_size/32, input_size/2^5, 320]

I believe it's consistent between EfficientNet and EfficientDet paper and their implementation. Why do you think they lied anywhere?

@rmcavoy
Copy link
Author

rmcavoy commented Mar 30, 2020

Its not a lie per se. It is just exceedingly misleading to suggest you extracted feature 3-7 without adding in the extra equations that explain how you manufactured two of the levels from the other three you actually extracted. There's honestly a bunch of equations like 1x1 downsampling that they just left out of the paper.

@dkoguciuk
Copy link

Yes, totally agree.

And again yes: I've double-checked it and you're right: there is no 1x1 Conv2D between P6 and P7 - my bad!

@rwightman
Copy link

@rmcavoy yes, there are no convs in the feature map level rescalling unless there is a channel dim mismatch to bridge, so only for P6, not P7.

For the level naming, regardless of the impl here, I don't see any issues with the naming in paper/official impl... a 'feature level' shouldn't be confused or mixed with a 'stage' in the backbone model. Feature levels are dilineated by stride 2 stages in the backbone, so a stride 2 followed by a stride 1 stage is one feature level. Every 'level''s number reflects the feature map output stride reduction, 2^level is the reduction factor from the input image. A D1 models has a 640x640 input image so I know the feature map resolution at P6 is 640/64=10x10. In that sense it is more confusing to refer to the stages in the backbone as levels but keep them distinct.

@rwightman
Copy link

I do agree that the paper was lacking some detail to fully reproduce the model without making some wrong guesses. I suspect even the official impl may have changed a bit since the paper. The official repo has already released more than one set of checkpoints themselves, fixed some important bugs since going live. It still has some model/checkpoint impacting details I'd consider worth fixing.. like most of the convs having both a bias and a bn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants