Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transformer implementation seemingly not corresponding to paper #46

Open
patrickgadd opened this issue Sep 6, 2022 · 3 comments
Open

Comments

@patrickgadd
Copy link

Hi there,

First off - thank you for open sourcing this very interesting work(!).

I'm having a look at this repo along with the paper "Context-Aware Learning to Rank with Self-Attention", and it seems there is a bug, as the code doesn't seem to correspond exactly to what's written in the paper:

https://github.com/allegro/allRank/blob/master/allrank/models/transformer.py#L105
Here you seemingly apply LayerNorm to the inputs to the MultiHeadAttention and Fully-Connected layers.
However, in the paper (and from what I gather is general practice for Transformers), you've written that LayerNorm is applied to the outputs of these corresponding layers.

It seems that this repo isn't actively worked on, but nonetheless I thought I'd let you know.

Thanks & best wishes,
Patrick

@patrickgadd
Copy link
Author

I believe

    def forward(self, x, sublayer):
        return x + self.dropout(
            sublayer(self.norm(x)))

ought to be


    def forward(self, x, sublayer):
        return self.norm(x + self.dropout(
            sublayer(x)))

So a rather minor thing, but just FYI

@PrzemekPobrotyn
Copy link
Contributor

Hi,

Thank you for the issue and a PR. You're right that there is a discrepancy between our implementation and the paper in terms of the placement of the layer norm and your proposed fix would indeed resolve that discrepancy. However, having consulted with the rest of the team working on the paper we decided it's best to leave the code as is. The reason is that even though in the paper we have written that the layer norm if applied as the very last layer on the output of the residual connection, the results reported in the paper are obtained with the implementation as in this repo i.e. with layer norm applied to the output of FFN, before dropout and residual connection. Following the steps from our reproducibility guide https://github.com/allegro/allRank/tree/master/reproducibility one can easily obtain the exact same results as reported in the paper. If we were to change the order of the layer norm, even though we'd make the implementation follow what's written in the paper, we would break the reproducibility of the results. Hence we decided against it. Once again thanks for pointing out that issue and I hope you understand our rationale.

Best,
Przemek

@patrickgadd
Copy link
Author

That is perfectly reasonable and understandable.
Hopefully if another curious party wonders they can see the explanation here.

Thank you for taking the time to look into it and getting back on this.

Best,
Patrick

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants