Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

在进行Building trainer时,训练会卡住; #46

Open
coderchem opened this issue Jan 11, 2024 · 1 comment
Open

在进行Building trainer时,训练会卡住; #46

coderchem opened this issue Jan 11, 2024 · 1 comment

Comments

@coderchem
Copy link

你好,我使用的是样例测试集,想跑通README. 但是发现,在训练的时候,会卡住,然后超时;
[batch=23/3200]:
Train time/batch: 22
Train time/sample: 198
Train time/batch_in_epoch: 6
Train time/sample_in_epoch: 54
Train time/token: 811008
Train time/token_in_epoch: 221184
Train metrics/train/cc_weight: 0.6700
Train metrics/train/github_weight: 0.0450
Train metrics/train/book_weight: 0.0450
Train metrics/train/stackexchange_weight: 0.0200
Train metrics/train/wiki_weight: 0.0450
Train metrics/train/arxiv_weight: 0.0250
Train metrics/train/c4-rp_weight: 0.1500
Train memory/current_allocated_mem: 36.8820
Train memory/current_active_mem: 36.8820
Train memory/current_inactive_mem: 0.1744
Train memory/current_reserved_mem: 55.9060
Train memory/peak_allocated_mem: 42.9380
Train memory/peak_active_mem: 42.9380
Train memory/peak_inactive_mem: 7.8742
Train memory/peak_reserved_mem: 55.9060
Train memory/alloc_retries: 0
Train metrics/train/expected_head_sparsity: 0.0039
Train metrics/train/target_head_sparsity: 0.0129
Train metrics/train/expected_intermediate_sparsity: 0.0039
Train metrics/train/target_intermediate_sparsity: 0.0128
Train metrics/train/expected_layer_sparsity: 0.0039
Train metrics/train/target_layer_sparsity: 0.0000
Train metrics/train/expected_hidden_sparsity: 0.0039
Train metrics/train/target_hidden_sparsity: 0.0129
Train metrics/train/expected_sparsity: 0.0117
Train metrics/train/target_sparsity: 0.0209
Train trainer/device_train_microbatch_size: 3
Train loss/train/total: 1.4801
Train loss/train/ce_loss: 1.4716
Train loss/train/lag_loss: 0.0085
Train metrics/train/LanguageCrossEntropy: 1.4716
Train metrics/train/Perplexity: 4.3561
Train metrics/train/cc_LanguageCrossEntropy: 1.1558
Train metrics/train/cc_count: 65
Train metrics/train/github_LanguageCrossEntropy: nan
Train metrics/train/github_count: 7
Train metrics/train/book_LanguageCrossEntropy: nan
Train metrics/train/book_count: 7
Train metrics/train/stackexchange_LanguageCrossEntropy: 2.1491
Train metrics/train/stackexchange_count: 3
Train metrics/train/wiki_LanguageCrossEntropy: 1.5306
Train metrics/train/wiki_count: 8
Train metrics/train/arxiv_LanguageCrossEntropy: nan
Train metrics/train/arxiv_count: 6
Train metrics/train/c4-rp_LanguageCrossEntropy: 1.6471
Train metrics/train/c4-rp_count: 111
Train throughput/batches_per_sec: 0.0914
Train throughput/samples_per_sec: 0.8223
Train throughput/device/batches_per_sec: 0.0305
Train throughput/device/samples_per_sec: 0.2741
Train throughput/tokens_per_sec: 3368.2385
Train throughput/device/tokens_per_sec: 1122.7462
Train throughput/flops_per_sec: 157886485043818.8125
Train throughput/device/flops_per_sec: 52628828347939.6016
Train throughput/device/mfu: 0.1687
Train time/train: 0.0709
Train time/val: 0.0000
Train time/total: 0.0709
Train lr-DecoupledAdamW/group0: 0.0000
Train lr-DecoupledAdamW/group1: 0.0688
Train lr-DecoupledAdamW/group2: -0.0688
[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3777, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1802129 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3777, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1802129 milliseconds before timing out.

@Forival
Copy link

Forival commented Jan 11, 2024

你好,我使用的是样例测试集,想跑通README. 但是发现,在训练的时候,会卡住,然后超时; [batch=23/3200]: Train time/batch: 22 Train time/sample: 198 Train time/batch_in_epoch: 6 Train time/sample_in_epoch: 54 Train time/token: 811008 Train time/token_in_epoch: 221184 Train metrics/train/cc_weight: 0.6700 Train metrics/train/github_weight: 0.0450 Train metrics/train/book_weight: 0.0450 Train metrics/train/stackexchange_weight: 0.0200 Train metrics/train/wiki_weight: 0.0450 Train metrics/train/arxiv_weight: 0.0250 Train metrics/train/c4-rp_weight: 0.1500 Train memory/current_allocated_mem: 36.8820 Train memory/current_active_mem: 36.8820 Train memory/current_inactive_mem: 0.1744 Train memory/current_reserved_mem: 55.9060 Train memory/peak_allocated_mem: 42.9380 Train memory/peak_active_mem: 42.9380 Train memory/peak_inactive_mem: 7.8742 Train memory/peak_reserved_mem: 55.9060 Train memory/alloc_retries: 0 Train metrics/train/expected_head_sparsity: 0.0039 Train metrics/train/target_head_sparsity: 0.0129 Train metrics/train/expected_intermediate_sparsity: 0.0039 Train metrics/train/target_intermediate_sparsity: 0.0128 Train metrics/train/expected_layer_sparsity: 0.0039 Train metrics/train/target_layer_sparsity: 0.0000 Train metrics/train/expected_hidden_sparsity: 0.0039 Train metrics/train/target_hidden_sparsity: 0.0129 Train metrics/train/expected_sparsity: 0.0117 Train metrics/train/target_sparsity: 0.0209 Train trainer/device_train_microbatch_size: 3 Train loss/train/total: 1.4801 Train loss/train/ce_loss: 1.4716 Train loss/train/lag_loss: 0.0085 Train metrics/train/LanguageCrossEntropy: 1.4716 Train metrics/train/Perplexity: 4.3561 Train metrics/train/cc_LanguageCrossEntropy: 1.1558 Train metrics/train/cc_count: 65 Train metrics/train/github_LanguageCrossEntropy: nan Train metrics/train/github_count: 7 Train metrics/train/book_LanguageCrossEntropy: nan Train metrics/train/book_count: 7 Train metrics/train/stackexchange_LanguageCrossEntropy: 2.1491 Train metrics/train/stackexchange_count: 3 Train metrics/train/wiki_LanguageCrossEntropy: 1.5306 Train metrics/train/wiki_count: 8 Train metrics/train/arxiv_LanguageCrossEntropy: nan Train metrics/train/arxiv_count: 6 Train metrics/train/c4-rp_LanguageCrossEntropy: 1.6471 Train metrics/train/c4-rp_count: 111 Train throughput/batches_per_sec: 0.0914 Train throughput/samples_per_sec: 0.8223 Train throughput/device/batches_per_sec: 0.0305 Train throughput/device/samples_per_sec: 0.2741 Train throughput/tokens_per_sec: 3368.2385 Train throughput/device/tokens_per_sec: 1122.7462 Train throughput/flops_per_sec: 157886485043818.8125 Train throughput/device/flops_per_sec: 52628828347939.6016 Train throughput/device/mfu: 0.1687 Train time/train: 0.0709 Train time/val: 0.0000 Train time/total: 0.0709 Train lr-DecoupledAdamW/group0: 0.0000 Train lr-DecoupledAdamW/group1: 0.0688 Train lr-DecoupledAdamW/group2: -0.0688 [E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3777, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1802129 milliseconds before timing out. [E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3777, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1802129 milliseconds before timing out.

这是因为样例测试集的数据量很少,23个batch之后某个数据用光了,某张卡上训练停止了,你需要处理原始的redpajama来满足数据要求

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants