[Question]Running the DCN on a single GPU leads to the illegal memory access #419

dusir · 2023-09-14T14:13:35Z

This template is for generic questions that a user may have in using HugeCTR.

Note: Before filing an issue, you may want to check out our compiled Q&A list first.

环境信息：
虚拟机环境；
单GPU卡训练；
容器里跑训练代码；

重要组件版本：
kernel:5.4.119-19.0009.28
hugectr：23.06
Driver Version: 535.54.03 CUDA Version: 12.2

现象：
root@5583dc65ca3a:/home/workspace/gq# python dcn_init.py
MpiInitService: MPI was already initialized by another (non-HugeCTR) mechanism.
[HCTR][13:48:01.955][INFO][RK0][main]: Empty embedding, trained table will be stored in /root/dcn_test/
HugeCTR Version: 23.6
====================================================Model Init=====================================================
[HCTR][13:48:01.956][INFO][RK0][main]: Initialize model: dcn_test
[HCTR][13:48:01.956][INFO][RK0][main]: Global seed is 3137833461
[HCTR][13:48:02.048][INFO][RK0][main]: Device to NUMA mapping:
GPU 2 -> node 1
[HCTR][13:48:02.843][WARNING][RK0][main]: Peer-to-peer access cannot be fully enabled.
[HCTR][13:48:02.843][DEBUG][RK0][main]: [device 2] allocating 0.0000 GB, available 78.2526
[HCTR][13:48:02.843][INFO][RK0][main]: Start all2all warmup
[HCTR][13:48:02.845][INFO][RK0][main]: End all2all warmup
[HCTR][13:48:03.363][INFO][RK0][main]: Using All-reduce algorithm: NCCL
set_mempolicy: Operation not permitted
[HCTR][13:48:03.374][INFO][RK0][main]: Device 2: NVIDIA H800
[HCTR][13:48:03.386][INFO][RK0][main]: eval source /root/keyset_dir/eval.txt max_row_group_size 2565543
[HCTR][13:48:03.397][INFO][RK0][main]: train source /root/keyset_dir/0.txt max_row_group_size 2565543
[HCTR][13:48:03.408][INFO][RK0][main]: train source /root/keyset_dir/1.txt max_row_group_size 2565543
[HCTR][13:48:03.420][INFO][RK0][main]: train source /root/keyset_dir/2.txt max_row_group_size 2565543
[HCTR][13:48:03.420][INFO][RK0][main]: num of DataReader workers for train: 1
[HCTR][13:48:03.420][INFO][RK0][main]: num of DataReader workers for eval: 1
set_mempolicy: Operation not permitted
set_mempolicy: Operation not permitted
set_mempolicy: Operation not permitted
set_mempolicy: Operation not permitted
[HCTR][13:48:03.454][INFO][RK0][main]: max_vocabulary_size_per_gpu_=145817600
[HCTR][13:48:03.496][DEBUG][RK0][main]: [device 2] allocating 27.4671 GB, available 35.8756
[HCTR][13:48:03.496][INFO][RK0][main]: gpu sync_all_gpus
gpu sync_all_gpus resource_manager:12
[HCTR][13:48:03.496][INFO][RK0][main]: gpu sync_all_gpus gpu count 1
[HCTR][13:48:03.496][INFO][RK0][main]: gpu sync_all_gpus local gpu 0x3220dd0
[HCTR][13:48:03.496][INFO][RK0][main]: set device start... 2
[HCTR][13:48:03.496][INFO][RK0][main]: set device done,device id is 2
[HCTR][13:48:03.496][INFO][RK0][main]: set device done.
[HCTR][13:48:03.496][INFO][RK0][main]: set device done,stream ptr: 0x4926810
[HCTR][13:48:03.501][INFO][RK0][main]: synchronize done.
[HCTR][13:48:03.502][INFO][RK0][main]: Graph analysis to resolve tensor dependency
[HCTR][13:48:03.502][INFO][RK0][main]: Add Slice layer for tensor: reshape1, creating 2 copies
[HCTR][13:48:03.502][WARNING][RK0][main]: using multi-cross v1
[HCTR][13:48:03.507][WARNING][RK0][main]: using multi-cross v1
===================================================Model Compile===================================================
[HCTR][13:49:03.167][INFO][RK0][main]: gpu0 start to init embedding
[HCTR][13:49:03.187][INFO][RK0][main]: gpu0 init embedding done
[HCTR][13:49:03.187][INFO][RK0][main]: Enable HMEM-Based Parameter Server
[HCTR][13:49:03.187][INFO][RK0][main]: /root/dcn_test/ not exist, create and train from scratch
[HCTR][13:49:15.625][DEBUG][RK0][main]: [device 2] allocating 1.0864 GB, available 26.7877
[HCTR][13:49:15.629][INFO][RK0][main]: Starting AUC NCCL warm-up
terminate called after throwing an instance of 'HugeCTR::core23::RuntimeError'
what(): Runtime error: an illegal memory access was encountered
cudaStreamSynchronize(stream) at run_finalize_step (/home/HugeCTR/HugeCTR/src/metrics.cu:1814)
[5583dc65ca3a:00137] *** Process received signal ***
[5583dc65ca3a:00137] Signal: Aborted (6)
[5583dc65ca3a:00137] Signal code: (-6)
[5583dc65ca3a:00137] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fc6ff3f6090]
[5583dc65ca3a:00137] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fc6ff3f600b]
[5583dc65ca3a:00137] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fc6ff3d5859]
[5583dc65ca3a:00137] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7fc6f9b53911]
[5583dc65ca3a:00137] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7fc6f9b5f38c]
[5583dc65ca3a:00137] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa9369)[0x7fc6f9b5e369]
[5583dc65ca3a:00137] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x2a1)[0x7fc6f9b5ed21]
[5583dc65ca3a:00137] [ 7] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x10bef)[0x7fc6f9aaabef]
[5583dc65ca3a:00137] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12a)[0x7fc6f9aab5aa]
[5583dc65ca3a:00137] [ 9] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR7metrics3AUCIfE23finalize_metric_per_gpuEi+0x397)[0x7fc6fa947ad7]
[5583dc65ca3a:00137] [10] /usr/local/hugectr/lib/libhuge_ctr_shared.so(+0xbd4b12)[0x7fc6fa947b12]
[5583dc65ca3a:00137] [11] /usr/lib/x86_64-linux-gnu/libgomp.so.1(GOMP_parallel+0x46)[0x7fc6938698e6]
[5583dc65ca3a:00137] [12] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR7metrics3AUCIfE15finalize_metricEv+0x9b)[0x7fc6fa8faf4b]
[5583dc65ca3a:00137] [13] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR7metrics3AUCIfE7warm_upEm+0xb6)[0x7fc6fa8ff1b6]
[5583dc65ca3a:00137] [14] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR7metrics3AUCIfEC2EiiiRKSt10shared_ptrINS_15ResourceManagerEEb+0x74a)[0x7fc6fa9326da]
[5583dc65ca3a:00137] [15] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR7metrics6Metric6CreateENS0_4TypeEbiiiRKSt10shared_ptrINS_15ResourceManagerEEb+0x1b1)[0x7fc6fa8fabc1]
[5583dc65ca3a:00137] [16] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR5Model14create_metricsEv+0xc5)[0x7fc6faa513d5]
[5583dc65ca3a:00137] [17] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR5Model7compileEv+0x286)[0x7fc6faa55b36]
[5583dc65ca3a:00137] [18] /usr/local/hugectr/lib/hugectr.so(+0xce702)[0x7fc6fed3a702]
[5583dc65ca3a:00137] [19] /usr/local/hugectr/lib/hugectr.so(+0xd0f94)[0x7fc6fed3cf94]
[5583dc65ca3a:00137] [20] python(PyCFunction_Call+0x59)[0x5f6939]
[5583dc65ca3a:00137] [21] python(_PyObject_MakeTpCall+0x296)[0x5f7506]
[5583dc65ca3a:00137] [22] python[0x50b8d3]
[5583dc65ca3a:00137] [23] python(_PyEval_EvalFrameDefault+0x5796)[0x570556]
[5583dc65ca3a:00137] [24] python(_PyEval_EvalCodeWithName+0x26a)[0x5697da]
[5583dc65ca3a:00137] [25] python(PyEval_EvalCode+0x27)[0x68e547]
[5583dc65ca3a:00137] [26] python[0x67dbf1]
[5583dc65ca3a:00137] [27] python[0x67dc6f]
[5583dc65ca3a:00137] [28] python[0x67dd11]
[5583dc65ca3a:00137] [29] python(PyRun_SimpleFileExFlags+0x197)[0x67fe37]
[5583dc65ca3a:00137] *** End of error message ***
Aborted (core dumped)

这个metrics报告的内存问题可能是个bug？

minseokl · 2023-09-18T09:43:07Z

Hi @dusir We can reproduce this issue on A100 as well. It is unrelated to H800 but the bug of our AUC implementation with a specific batch size. We are working on fixing it. Thanks!

dusir added the question Further information is requested label Sep 14, 2023

minseokl changed the title ~~[Question]单GPU卡H800跑hugectr dcn会遇到非法内存问题~~ [Question]Running the DCN on a single H800 leads to the illegal memory access Sep 18, 2023

minseokl changed the title ~~[Question]Running the DCN on a single H800 leads to the illegal memory access~~ [Question]Running the DCN on a single GPU leads to the illegal memory access Sep 18, 2023

JacoCheung self-assigned this Sep 18, 2023

JacoCheung added the stage::doing label Oct 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]Running the DCN on a single GPU leads to the illegal memory access #419

[Question]Running the DCN on a single GPU leads to the illegal memory access #419

dusir commented Sep 14, 2023 •

edited

minseokl commented Sep 18, 2023

[Question]Running the DCN on a single GPU leads to the illegal memory access #419

[Question]Running the DCN on a single GPU leads to the illegal memory access #419

Comments

dusir commented Sep 14, 2023 • edited

minseokl commented Sep 18, 2023

dusir commented Sep 14, 2023 •

edited