Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diffs to upstream megatron as a basis for discussion towards TE integration #1185

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

tf-nv
Copy link
Contributor

@tf-nv tf-nv commented Mar 13, 2024

Here's three commits:

  1. One with the full diff of GPT-NeoX's megatron folder with current upstream Megatron-LM. That's 256 files with ~60k lines. However most are completely new or deleted.
  2. One with only the new files
  3. One with only the modified files (state as seen in GitHub's PR view, ±5k lines over 17 files)

This PR is not meant to be merged, but to study the feasibility to track upstream megatron more closely.

Updating GPT-NeoX's megatron to upstream would

  1. Avoid divergence to upstream vs. introducing TE "manually"
  2. Most likely yield better performance for the same effort. TE has a few more complex features e.g. for overlapping comms with GEMM for TP workloads. And upstream megatron is a vetted implementation.
  3. Be more effort than just replacing a few layers with TE and introducing autocasting.

The new or deleted files are basically the non overlapping features sets of upstream Megatron-LM and GPT-NeoX. So let us have a look at the modified files. The hard ones are:

  • megatron/checkpointing.py is basically a full replacement. I don't think this affects TE. As far as I can tell, the weights are all still bf16, and are scaled and casted to fp8 for GEMM
  • Fused kernels: I think TE has e.g. fused RoPE, we could use that one instead of GPT-NeoX one. Not sure about the other kernels.
  • megatron/initialize.py Is key to the parallelism configs, and will be a hard nut to crack. Not sure how easily the DeepSpeed parallelism and TE can be integrated.
  • megatron/model/transformer.py This is the first part of the crux. Basically a full replacement, plus GPT-NeoX supports some additional model architectures (e.g. retro)
  • megatron/training.py Second part of the crux, the training loop is of course at the core of the libraries.
  • Utils / Tokenizers could be easier, as they only depend on the others.

What do you think, which way should we go?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant