Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use fuse multi head att #417

Open
wants to merge 14 commits into
base: main
Choose a base branch
from
Open

Use fuse multi head att #417

wants to merge 14 commits into from

Conversation

xiezipeng-ML
Copy link
Contributor

No description provided.

@xiezipeng-ML
Copy link
Contributor Author

xiezipeng-ML commented Nov 1, 2022

batch size = 4, acc step = 8, amp, open Checkpointing

1n1g use_fuse_multi_head_att = False use_fuse_multi_head_att = True
Throughput total_throughput: 151.70 samples/s total_throughput: 155.41 samples/s
GPU Memory 3147MiB 3129MiB

encoderdecoder中的self_attcross_att中都使用了fuse_multihead_att.
在28号上简单测了一下,带来的提升有限,应该是transpose的使用次数太多,我下个commit准备把if,else直接取消,默认使用fuse_multihead_att来测一下.

@chengtbf @strint @ouyangyu @CPFLAME

Comment on lines 177 to 178
if self.multihead_attn_fusion:
hidden_states = hidden_states.transpose(0, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里在进行cross_attention的时候, 每次都需要transpose一下

有没有办法只在外面transpose一次?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

原本的 fuse multihead attention 就是在整个网络一开始做一次 transpose,使得整个 transformer layer 里的数据维度都是转置后的维度(到 loss 的部分可以考虑再转置回来,但是 Megatron 即使在 loss 部分貌似也是转置后的),这样可以省去每个 layer 内部的 transpose 操作。这样才能达到提升速度的目的。

Copy link

@chengtbf chengtbf Nov 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参考代码 :

默认进到 attention 里面的就是 [sq, b, h] , batch size 在 dim 1

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可以参考之前 程鹏 和 星宇 的代码:

@xiezipeng-ML
Copy link
Contributor Author

  • 这里魔改了一下,self_attcross_att都使用了fuse_muti_head_attattention层默认为fuse_multi_head_att,一共只多出3个必须的transposeencode_embedding的输出进行一次transposedecoder_embedding的输出进行一次transposeloss接收的logits进行一次transpose
  • 如果数据处理的时候直接处理成[seq_len, batch_size]shape的话上述3个transpose可以取消
  • 用这个pr下面的单测测过了修改后的模型和huggingface对齐:tests/model_utils/test_mt5_loader_2.py

@chengtbf @CPFLAME @strint @ouyangyu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants