New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama2 training hangs when pp_size > 1 #9146
Labels
bug
Something isn't working
Comments
@maanug-nv , could you look at this one? |
@ericharper , @maanug-nv ;
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
I am following guide to fine tune llama2-7B model on 2 nodes (H100).
my training hangs at dalaloader sanity checking.
A clear and concise description of what the bug is.
Steps/Code to reproduce bug
docker image: nvcr.io/nvidia/nemo:24.03.01.framework
follow guide to run llama2-7B
command I run on each node
Please list minimal steps or code snippet for us to be able to reproduce the bug.
A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.
Expected behavior
A clear and concise description of what you expected to happen.
Environment overview (please complete the following information)
docker pull
&docker run
commands useddocker pull nvcr.io/nvidia/nemo:24.03.01.framework
Environment details
If NVIDIA docker image is used you don't need to specify these.
Otherwise, please provide:
Additional context
Add any other context about the problem here.
Example: GPU model : 16xH100
Please let me know if any other information needed. Thank you
The text was updated successfully, but these errors were encountered: