Back to pull request #9161

Fix MoE EP rank when TP is set at the same time #1957

Sign in to view logs

Re-run triggered May 13, 2024 19:48

gdengk:gaod/moe/fix_nemo_ep_rank

Status Failure

Total duration 1h 49m 42s

Artifacts –

cicd-main.yml

on: pull_request

cicd-cluster-clean

cicd-test-container-setup

L0_Unit_Tests_GPU

L0_Unit_Tests_CPU

L2_Community_LLM_Checkpoints_tests_Llama

L2_Community_LLM_Checkpoints_tests_StarCoder

L2_Community_LLM_Checkpoints_tests_Falcon

ASR_dev_run_Speech_to_Text

ASR_dev_run_Speech_to_Text_WPE_-_CitriNet

ASR_dev_run_Speech_Pre-training_-_CitriNet

ASR_dev_run_Speech_To_Text_Finetuning

ASR_dev_run_Speech_to_Text_WPE_-_Conformer

ASR_dev_run-part_two_Speech_to_Text_WPE_-_Squeezeformer

L2_Speech_to_Text_EMA

L2_Speaker_dev_run_Speaker_Recognition

L2_Speaker_dev_run_Speaker_Diarization

L2_Speaker_dev_run_Speech_to_Label

L2_Speaker_dev_run_Speaker_Diarization_with_ASR_Inference

L2_Speaker_dev_run_Clustering_Diarizer_Inference

L2_Speaker_dev_run_Neural_Diarizer_Inference

L2_Speaker_dev_run_Multispeaker_ASR_Data_Simulation

L2_ASR_Multi-dataloader_dev_run_Speech_to_Text_multi-dataloader

L2_ASR_Multi-dataloader_dev_run_Speech_to_Label_multi-dataloader

L2_ASR_Adapters_Linear_Adapters

L2_ASR_Adapters_RelPos_MHA_Adapters

L2_Speech_Transcription_Speech_to_Text_Transcribe

L2_Transducer_alignment_Running_pytest

L2_Segmentation_Tool_Parallel_ctc_segmentation_test_L2_Eng_CitriNet_with_wav

L2_Segmentation_Tool_Parallel_ctc_segmentation_test_L2_Ru_QN_with_mp3

L2_G2P_Models_G2P_Conformer_training_evaluation_and_inference

L2_G2P_Models_HeteronymClassificationModel_training_evaluation_and_inference

L2_Dialogue_Classification_Intent_and_slot_classification_using_SGDQA

L2_Dialogue_Classification_Intent_and_slot_classification_using_IntentSlotClassificationModel

L2_Dialogue_Classification_Intent_classification_using_ZeroShotIntentModel

L2_Dialogue_Classification_Design_Intent_classification_using_ZeroShotIntentModel

L2_Dialogue_Classification_Design_Intent_classification_using_ZeroShotIntentModel_BART_Classifier

L2_Dialogue_Classification_Design_Intent_classification_using_DialogueNearestNeighbourModel

L2_Dialogue_Generation_Dialogue_Answer_Extender_using_DialogueS2SGenerationModel

L2_Dialogue_Generation_Dialogue_SGD_Based_Answer_Extender_using_DialogueS2SGenerationModel

L2_COPY_Dialogue_Answer_Extender_using_DialogueGPTGenerationModel

L2_Duplex_Text_Normalization_with_Tarred_dataset

L2_BERT_Text_Classification_with_BERT_Test

L2_Parallel_BERT_Question-Answering_SQUAD_v1_1

L2_Parallel_BERT_Question-Answering_SQUAD_v2_0

L2_Parallel_BART_Question-Answering_SQUAD_v1_1

L2_Parallel_BART_Question-Answering_SQUAD_v2_0

L2_Parallel_GPT2_Question-Answering_SQUAD_v1_1

L2_Parallel_GPT2_Question-Answering_SQUAD_v2_0

L2_Intent_and_Slot_Classification_Tasks_Intent_and_Slot_Classification

L2_Intent_and_Slot_Classification_Tasks_Multi-Label_Intent_and_Slot_Classification

L2_Parallel_NLP_Examples2_NER_finetuning_from_pretrained_Test

L2_Parallel_NLP_Examples2_Punctuation_and_capitalization_finetuning_from_pretrained_test

L2_Parallel_NLP_Examples2_NER_with_TurkuNLP__bert-base-finnish-cased-v1

L2_Parallel_NLP_Examples2_Evaluation_script_for_Token_Classification

L2_Parallel_NLP_Examples2_Evaluation_script_for_Punctuation

L2_Parallel_NLP_Examples2_Punctuation_Capitalization_2GPUs_with_DistilBERT_Finetuning_on_other_data

Punctuation_Capitalization_tarred_dataset_create_and_use_tarred_dataset

Punctuation_Capitalization_Using_model-common_datasets_parameters-label_vocab_dir

Punctuation_Capitalization_inference_Restore_punctuation_and_capitalization_in_long_text

L2_Pretraining_BERT_pretraining_from_Text

L2_Pretraining_BERT_from_Preprocessed

L2_Entity_Linking_Self_Alignment_Pretraining_BERT

L2_NMT_Attention_is_All_You_Need_Training_NMT_Training_Post-LN

L2_NMT_Attention_is_All_You_Need_Training_NMT_Training_Pre-LN

L2_NMT_Attention_is_All_You_Need_Training_NMT_Multi-Validation

L2_NMT_Attention_is_All_You_Need_Inference

L2_NMT_Attention_is_All_You_Need_Finetuning

L2_NMT_Tarred_Dataset_Creation_Auto_Tarred_Dataset_Creation

L2_NMT_Tarred_Dataset_Creation_Script_Tarred_Dataset_Creation

L2_Megatron_NMT_Training_TP2

L2_Megatron_BART_Perceiver_MIM_Training_TP2

L2_Megatron_Bert_Pretraining_and_Resume_Training_with_Pipeline_Parallelism

L2_Megatron_Bert_Pretraining_and_Resume_Training

L2_Megatron_Core_Bert_Pretraining_and_Resume_Training

L2_Megatron_RETRO_Pretraining_and_Resume_Training

L2_Legacy_Megatron_RETRO_Pretraining_and_Resume_Training

L2_BioMegatron_Bert_NER_Task

L2_Megatron_GPT_Pretraining_and_Resume_Training_TP2

L2_Megatron_GPT_with_Rope_Pretraining_and_Resume_Training_TP2

L2_Megatron_GPT_with_ALiBi_Pretraining_and_Resume_Training_TP2

L2_Megatron_GPT_with_KERPLE_Pretraining_and_Resume_Training_TP2

L2_Megatron_GPT_Pretraining_and_Resume_Training_PP2

L2_Megatron_GPT_Finetuning_PP2

L2_Megatron_GPT_Finetuning_StarCoder_PP1

L2_Megatron_GPT_Embedding

L2_Megatron_GPT_PEFT_Lora_PP2

L2_Megatron_GPT_PEFT_Lora_TP2

L2_Megatron_GPT_Eval

L2_Megatron_GPT_Eval_PP2

L2_Megatron_GPT_SFT_Eval_inference_seq_len_greaterThan_training_seq_len

L2_Megatron_Change_Partitions_Reduce_TP_Num_Partitions_-2_to_1-_and_PP_Num_Partitions_-1_to_2

L2_Megatron_Change_Partitions_Increase_TP_Num_Partitions_-2_to_4-_and_PP_Num_Partitions_-1_to_2

L2_Megatron_T5_Pretraining_and_Resume_Training_TP2

L2_Megatron_T5_with_ALiBi_Pretraining_and_Resume_Training_TP2

L2_Megatron_T5_with_KERPLE_Pretraining_and_Resume_Training_TP2

L2_Megatron_T5_Pretraining_and_Resume_Training_PP2

L2_Megatron_T5_w_Mixture_of_Expert_Pretraining

L2_Megatron_UL2_Pretraining_and_Resume_Training_TP2

L2_Megatron_T5_Eval

L2_Megatron_BART_Pretraining_and_Resume_Training_TP2

L2_Megatron_BART_Pretraining_and_Resume_Training_PP2

L2_Megatron_T5_GLUE_RTE

L2_Megatron_T5_GLUE_XNLI

L2_Megatron_T5_PEFT_Lora_TP2

L2_Megatron_Mock_Data_Generation_MockGPTDataset

L2_Megatron_Mock_Data_Generation_MockT5Dataset

L2_TTS_Fast_dev_runs_1_Tacotron_2

L2_TTS_Fast_dev_runs_1_WaveGlow

L2_TTS_Fast_dev_runs_1_FastPitch

L2_TTS_Fast_dev_runs_1_Mixer-TTS

L2_TTS_Fast_dev_runs_1_Hifigan

Speech_Checkpoints_tests

L0_Setup_Test_Data_And_Models

L2_Community_LLM_Checkpoints_tests_Llama3

L2_PTQ_Llama2_Export_Only

OPTIONAL_ASR_dev_run_Speech_To_Text_HF_Finetuning

Annotations

2 errors and 2 warnings

Speech_Checkpoints_tests

The job running on runner azure-gpu-vm-runner4 has exceeded the maximum execution time of 10 minutes.

Speech_Checkpoints_tests

The operation was canceled.

L2_Community_LLM_Checkpoints_tests_Llama3

Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20: actions/checkout@v2. For more information see: https://github.blog/changelog/2023-09-22-github-actions-transitioning-from-node-16-to-node-20/.

L2_Community_LLM_Checkpoints_tests_Llama3

The following actions uses node12 which is deprecated and will be forced to run on node16: actions/checkout@v2. For more info: https://github.blog/changelog/2023-06-13-github-actions-all-actions-will-run-on-node16-instead-of-node12-by-default/