在进行Building trainer时，训练会卡住； #46

coderchem · 2024-01-11T03:30:54Z

你好，我使用的是样例测试集，想跑通README. 但是发现，在训练的时候，会卡住，然后超时；
[batch=23/3200]:
Train time/batch: 22
Train time/sample: 198
Train time/batch_in_epoch: 6
Train time/sample_in_epoch: 54
Train time/token: 811008
Train time/token_in_epoch: 221184
Train metrics/train/cc_weight: 0.6700
Train metrics/train/github_weight: 0.0450
Train metrics/train/book_weight: 0.0450
Train metrics/train/stackexchange_weight: 0.0200
Train metrics/train/wiki_weight: 0.0450
Train metrics/train/arxiv_weight: 0.0250
Train metrics/train/c4-rp_weight: 0.1500
Train memory/current_allocated_mem: 36.8820
Train memory/current_active_mem: 36.8820
Train memory/current_inactive_mem: 0.1744
Train memory/current_reserved_mem: 55.9060
Train memory/peak_allocated_mem: 42.9380
Train memory/peak_active_mem: 42.9380
Train memory/peak_inactive_mem: 7.8742
Train memory/peak_reserved_mem: 55.9060
Train memory/alloc_retries: 0
Train metrics/train/expected_head_sparsity: 0.0039
Train metrics/train/target_head_sparsity: 0.0129
Train metrics/train/expected_intermediate_sparsity: 0.0039
Train metrics/train/target_intermediate_sparsity: 0.0128
Train metrics/train/expected_layer_sparsity: 0.0039
Train metrics/train/target_layer_sparsity: 0.0000
Train metrics/train/expected_hidden_sparsity: 0.0039
Train metrics/train/target_hidden_sparsity: 0.0129
Train metrics/train/expected_sparsity: 0.0117
Train metrics/train/target_sparsity: 0.0209
Train trainer/device_train_microbatch_size: 3
Train loss/train/total: 1.4801
Train loss/train/ce_loss: 1.4716
Train loss/train/lag_loss: 0.0085
Train metrics/train/LanguageCrossEntropy: 1.4716
Train metrics/train/Perplexity: 4.3561
Train metrics/train/cc_LanguageCrossEntropy: 1.1558
Train metrics/train/cc_count: 65
Train metrics/train/github_LanguageCrossEntropy: nan
Train metrics/train/github_count: 7
Train metrics/train/book_LanguageCrossEntropy: nan
Train metrics/train/book_count: 7
Train metrics/train/stackexchange_LanguageCrossEntropy: 2.1491
Train metrics/train/stackexchange_count: 3
Train metrics/train/wiki_LanguageCrossEntropy: 1.5306
Train metrics/train/wiki_count: 8
Train metrics/train/arxiv_LanguageCrossEntropy: nan
Train metrics/train/arxiv_count: 6
Train metrics/train/c4-rp_LanguageCrossEntropy: 1.6471
Train metrics/train/c4-rp_count: 111
Train throughput/batches_per_sec: 0.0914
Train throughput/samples_per_sec: 0.8223
Train throughput/device/batches_per_sec: 0.0305
Train throughput/device/samples_per_sec: 0.2741
Train throughput/tokens_per_sec: 3368.2385
Train throughput/device/tokens_per_sec: 1122.7462
Train throughput/flops_per_sec: 157886485043818.8125
Train throughput/device/flops_per_sec: 52628828347939.6016
Train throughput/device/mfu: 0.1687
Train time/train: 0.0709
Train time/val: 0.0000
Train time/total: 0.0709
Train lr-DecoupledAdamW/group0: 0.0000
Train lr-DecoupledAdamW/group1: 0.0688
Train lr-DecoupledAdamW/group2: -0.0688
[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3777, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1802129 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3777, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1802129 milliseconds before timing out.

Forival · 2024-01-11T07:29:08Z

你好，我使用的是样例测试集，想跑通README. 但是发现，在训练的时候，会卡住，然后超时； [batch=23/3200]: Train time/batch: 22 Train time/sample: 198 Train time/batch_in_epoch: 6 Train time/sample_in_epoch: 54 Train time/token: 811008 Train time/token_in_epoch: 221184 Train metrics/train/cc_weight: 0.6700 Train metrics/train/github_weight: 0.0450 Train metrics/train/book_weight: 0.0450 Train metrics/train/stackexchange_weight: 0.0200 Train metrics/train/wiki_weight: 0.0450 Train metrics/train/arxiv_weight: 0.0250 Train metrics/train/c4-rp_weight: 0.1500 Train memory/current_allocated_mem: 36.8820 Train memory/current_active_mem: 36.8820 Train memory/current_inactive_mem: 0.1744 Train memory/current_reserved_mem: 55.9060 Train memory/peak_allocated_mem: 42.9380 Train memory/peak_active_mem: 42.9380 Train memory/peak_inactive_mem: 7.8742 Train memory/peak_reserved_mem: 55.9060 Train memory/alloc_retries: 0 Train metrics/train/expected_head_sparsity: 0.0039 Train metrics/train/target_head_sparsity: 0.0129 Train metrics/train/expected_intermediate_sparsity: 0.0039 Train metrics/train/target_intermediate_sparsity: 0.0128 Train metrics/train/expected_layer_sparsity: 0.0039 Train metrics/train/target_layer_sparsity: 0.0000 Train metrics/train/expected_hidden_sparsity: 0.0039 Train metrics/train/target_hidden_sparsity: 0.0129 Train metrics/train/expected_sparsity: 0.0117 Train metrics/train/target_sparsity: 0.0209 Train trainer/device_train_microbatch_size: 3 Train loss/train/total: 1.4801 Train loss/train/ce_loss: 1.4716 Train loss/train/lag_loss: 0.0085 Train metrics/train/LanguageCrossEntropy: 1.4716 Train metrics/train/Perplexity: 4.3561 Train metrics/train/cc_LanguageCrossEntropy: 1.1558 Train metrics/train/cc_count: 65 Train metrics/train/github_LanguageCrossEntropy: nan Train metrics/train/github_count: 7 Train metrics/train/book_LanguageCrossEntropy: nan Train metrics/train/book_count: 7 Train metrics/train/stackexchange_LanguageCrossEntropy: 2.1491 Train metrics/train/stackexchange_count: 3 Train metrics/train/wiki_LanguageCrossEntropy: 1.5306 Train metrics/train/wiki_count: 8 Train metrics/train/arxiv_LanguageCrossEntropy: nan Train metrics/train/arxiv_count: 6 Train metrics/train/c4-rp_LanguageCrossEntropy: 1.6471 Train metrics/train/c4-rp_count: 111 Train throughput/batches_per_sec: 0.0914 Train throughput/samples_per_sec: 0.8223 Train throughput/device/batches_per_sec: 0.0305 Train throughput/device/samples_per_sec: 0.2741 Train throughput/tokens_per_sec: 3368.2385 Train throughput/device/tokens_per_sec: 1122.7462 Train throughput/flops_per_sec: 157886485043818.8125 Train throughput/device/flops_per_sec: 52628828347939.6016 Train throughput/device/mfu: 0.1687 Train time/train: 0.0709 Train time/val: 0.0000 Train time/total: 0.0709 Train lr-DecoupledAdamW/group0: 0.0000 Train lr-DecoupledAdamW/group1: 0.0688 Train lr-DecoupledAdamW/group2: -0.0688 [E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3777, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1802129 milliseconds before timing out. [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3777, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1802129 milliseconds before timing out.

这是因为样例测试集的数据量很少，23个batch之后某个数据用光了，某张卡上训练停止了，你需要处理原始的redpajama来满足数据要求

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

在进行Building trainer时，训练会卡住； #46

在进行Building trainer时，训练会卡住； #46

coderchem commented Jan 11, 2024

Forival commented Jan 11, 2024

在进行Building trainer时，训练会卡住； #46

在进行Building trainer时，训练会卡住； #46

Comments

coderchem commented Jan 11, 2024

Forival commented Jan 11, 2024