Diffs to upstream megatron as a basis for discussion towards TE integration #1185

tf-nv · 2024-03-13T12:16:29Z

Here's three commits:

One with the full diff of GPT-NeoX's megatron folder with current upstream Megatron-LM. That's 256 files with ~60k lines. However most are completely new or deleted.
One with only the new files
One with only the modified files (state as seen in GitHub's PR view, ±5k lines over 17 files)

This PR is not meant to be merged, but to study the feasibility to track upstream megatron more closely.

Updating GPT-NeoX's megatron to upstream would

Avoid divergence to upstream vs. introducing TE "manually"
Most likely yield better performance for the same effort. TE has a few more complex features e.g. for overlapping comms with GEMM for TP workloads. And upstream megatron is a vetted implementation.
Be more effort than just replacing a few layers with TE and introducing autocasting.

The new or deleted files are basically the non overlapping features sets of upstream Megatron-LM and GPT-NeoX. So let us have a look at the modified files. The hard ones are:

megatron/checkpointing.py is basically a full replacement. I don't think this affects TE. As far as I can tell, the weights are all still bf16, and are scaled and casted to fp8 for GEMM
Fused kernels: I think TE has e.g. fused RoPE, we could use that one instead of GPT-NeoX one. Not sure about the other kernels.
megatron/initialize.py Is key to the parallelism configs, and will be a hard nut to crack. Not sure how easily the DeepSpeed parallelism and TE can be integrated.
megatron/model/transformer.py This is the first part of the crux. Basically a full replacement, plus GPT-NeoX supports some additional model architectures (e.g. retro)
megatron/training.py Second part of the crux, the training loop is of course at the core of the libraries.
Utils / Tokenizers could be easier, as they only depend on the others.

What do you think, which way should we go?

tf-nv added 3 commits March 13, 2024 11:40

diff to current megatron

ef917bb

only modified files from upstream megatron

b156b8f

positive diff of only modified files

03b9345

tf-nv mentioned this pull request Mar 13, 2024

Integrate TransformerEngine #1098

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diffs to upstream megatron as a basis for discussion towards TE integration #1185

Diffs to upstream megatron as a basis for discussion towards TE integration #1185

tf-nv commented Mar 13, 2024

Diffs to upstream megatron as a basis for discussion towards TE integration #1185

Are you sure you want to change the base?

Diffs to upstream megatron as a basis for discussion towards TE integration #1185

Conversation

tf-nv commented Mar 13, 2024