feat: New Gen AI pattern - Llama2 Distributed Pre-training on Trn1 with RayTrain and KubeRay Operator #536

vara-bonthu · 2024-05-19T17:04:39Z

…rain and KubeRay

What does this PR do?

🛑 Please open an issue first to discuss any significant work and flesh out details/direction - we would hate for your time to be wasted.
Consult the CONTRIBUTING guide for submitting pull-requests.

- Adds a new pattern for Llama2 Distributed Pre-training on Trn1 with RayTrain and KubeRay Operator.

Motivation

- To provide a robust solution for distributed pre-training of Llama2 using AWS Trainium instances, leveraging the capabilities of RayTrain and KubeRay Operator for efficient and scalable training workflows.

More

Yes, I have tested the PR using my local account setup (Provide any test evidence report under Additional Notes)
Mandatory for new blueprints. Yes, I have added a example to support my blueprint PR
Mandatory for new blueprints. Yes, I have updated the website/docs or website/blog section for this feature
Yes, I ran pre-commit run -a with this PR. Link for installing pre-commit locally

For Moderators

E2E Test successfully complete before merge?

Additional Notes

…rain and KubeRay Signed-off-by: Vara Bonthu <[email protected]>

vara-bonthu · 2024-05-19T17:32:35Z

@5cp please review the PR. Thanks

Signed-off-by: Vara Bonthu <[email protected]>

gen-ai/training/raytrain-llama2-pretrain-trn1/Dockerfile

gen-ai/training/raytrain-llama2-pretrain-trn1/llama2-pretrain-trn1-raycluster.yaml

Signed-off-by: Vara Bonthu <[email protected]>

vara-bonthu added 2 commits May 19, 2024 10:01

New Gen AI pattern: Llama2 Distributed Pre-training on Trn1 with RayT…

cb07a26

…rain and KubeRay Signed-off-by: Vara Bonthu <[email protected]>

removed comments and updated

9b2514d

Added screeshots to the blueprint

35bb8e1

vara-bonthu requested review from askulkarni2, lusoal and ratnopamc May 19, 2024 18:53

Added Volcano to the raycluster job

72cd029

Signed-off-by: Vara Bonthu <[email protected]>

5cp reviewed May 24, 2024

View reviewed changes

gen-ai/training/raytrain-llama2-pretrain-trn1/Dockerfile Outdated Show resolved Hide resolved

5cp reviewed May 24, 2024

View reviewed changes

gen-ai/training/raytrain-llama2-pretrain-trn1/llama2-pretrain-trn1-raycluster.yaml Outdated Show resolved Hide resolved

vara-bonthu added 9 commits May 25, 2024 15:01

Updated raytrain with rayjob spec

1f51833

Update github workflows

ea19677

Signed-off-by: Vara Bonthu <[email protected]>

removing the unused declaration

760c627

continue on error addeed for codespell workflow

c165fec

continue on error addeed for codespell workflow

79436e0

pre-commit updates and removed workshop folder

c53241d

Signed-off-by: Vara Bonthu <[email protected]>

fixes emr eks ack module

c92d587

Moduel updated with correct path

b865eca

doc updates

c469d0b

vara-bonthu merged commit 5a2d1df into main May 26, 2024
34 of 36 checks passed

vara-bonthu deleted the reatrain-llama2-trn1 branch May 26, 2024 04:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: New Gen AI pattern - Llama2 Distributed Pre-training on Trn1 with RayTrain and KubeRay Operator #536

feat: New Gen AI pattern - Llama2 Distributed Pre-training on Trn1 with RayTrain and KubeRay Operator #536

vara-bonthu commented May 19, 2024 •

edited

vara-bonthu commented May 19, 2024

feat: New Gen AI pattern - Llama2 Distributed Pre-training on Trn1 with RayTrain and KubeRay Operator #536

feat: New Gen AI pattern - Llama2 Distributed Pre-training on Trn1 with RayTrain and KubeRay Operator #536

Conversation

vara-bonthu commented May 19, 2024 • edited

What does this PR do?

Motivation

More

For Moderators

Additional Notes

vara-bonthu commented May 19, 2024

vara-bonthu commented May 19, 2024 •

edited