Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]Running the DCN on a single GPU leads to the illegal memory access #419

Open
dusir opened this issue Sep 14, 2023 · 1 comment
Assignees
Labels
question Further information is requested stage::doing

Comments

@dusir
Copy link

dusir commented Sep 14, 2023

This template is for generic questions that a user may have in using HugeCTR.

Note: Before filing an issue, you may want to check out our compiled Q&A list first.

环境信息:
虚拟机环境;
单GPU卡训练;
容器里跑训练代码;

重要组件版本:
kernel:5.4.119-19.0009.28
hugectr:23.06
Driver Version: 535.54.03 CUDA Version: 12.2

现象:
root@5583dc65ca3a:/home/workspace/gq# python dcn_init.py
MpiInitService: MPI was already initialized by another (non-HugeCTR) mechanism.
[HCTR][13:48:01.955][INFO][RK0][main]: Empty embedding, trained table will be stored in /root/dcn_test/
HugeCTR Version: 23.6
====================================================Model Init=====================================================
[HCTR][13:48:01.956][INFO][RK0][main]: Initialize model: dcn_test
[HCTR][13:48:01.956][INFO][RK0][main]: Global seed is 3137833461
[HCTR][13:48:02.048][INFO][RK0][main]: Device to NUMA mapping:
GPU 2 -> node 1
[HCTR][13:48:02.843][WARNING][RK0][main]: Peer-to-peer access cannot be fully enabled.
[HCTR][13:48:02.843][DEBUG][RK0][main]: [device 2] allocating 0.0000 GB, available 78.2526
[HCTR][13:48:02.843][INFO][RK0][main]: Start all2all warmup
[HCTR][13:48:02.845][INFO][RK0][main]: End all2all warmup
[HCTR][13:48:03.363][INFO][RK0][main]: Using All-reduce algorithm: NCCL
set_mempolicy: Operation not permitted
[HCTR][13:48:03.374][INFO][RK0][main]: Device 2: NVIDIA H800
[HCTR][13:48:03.386][INFO][RK0][main]: eval source /root/keyset_dir/eval.txt max_row_group_size 2565543
[HCTR][13:48:03.397][INFO][RK0][main]: train source /root/keyset_dir/0.txt max_row_group_size 2565543
[HCTR][13:48:03.408][INFO][RK0][main]: train source /root/keyset_dir/1.txt max_row_group_size 2565543
[HCTR][13:48:03.420][INFO][RK0][main]: train source /root/keyset_dir/2.txt max_row_group_size 2565543
[HCTR][13:48:03.420][INFO][RK0][main]: num of DataReader workers for train: 1
[HCTR][13:48:03.420][INFO][RK0][main]: num of DataReader workers for eval: 1
set_mempolicy: Operation not permitted
set_mempolicy: Operation not permitted
set_mempolicy: Operation not permitted
set_mempolicy: Operation not permitted
[HCTR][13:48:03.454][INFO][RK0][main]: max_vocabulary_size_per_gpu_=145817600
[HCTR][13:48:03.496][DEBUG][RK0][main]: [device 2] allocating 27.4671 GB, available 35.8756
[HCTR][13:48:03.496][INFO][RK0][main]: gpu sync_all_gpus
gpu sync_all_gpus resource_manager:12
[HCTR][13:48:03.496][INFO][RK0][main]: gpu sync_all_gpus gpu count 1
[HCTR][13:48:03.496][INFO][RK0][main]: gpu sync_all_gpus local gpu 0x3220dd0
[HCTR][13:48:03.496][INFO][RK0][main]: set device start... 2
[HCTR][13:48:03.496][INFO][RK0][main]: set device done,device id is 2
[HCTR][13:48:03.496][INFO][RK0][main]: set device done.
[HCTR][13:48:03.496][INFO][RK0][main]: set device done,stream ptr: 0x4926810
[HCTR][13:48:03.501][INFO][RK0][main]: synchronize done.
[HCTR][13:48:03.502][INFO][RK0][main]: Graph analysis to resolve tensor dependency
[HCTR][13:48:03.502][INFO][RK0][main]: Add Slice layer for tensor: reshape1, creating 2 copies
[HCTR][13:48:03.502][WARNING][RK0][main]: using multi-cross v1
[HCTR][13:48:03.507][WARNING][RK0][main]: using multi-cross v1
===================================================Model Compile===================================================
[HCTR][13:49:03.167][INFO][RK0][main]: gpu0 start to init embedding
[HCTR][13:49:03.187][INFO][RK0][main]: gpu0 init embedding done
[HCTR][13:49:03.187][INFO][RK0][main]: Enable HMEM-Based Parameter Server
[HCTR][13:49:03.187][INFO][RK0][main]: /root/dcn_test/ not exist, create and train from scratch
[HCTR][13:49:15.625][DEBUG][RK0][main]: [device 2] allocating 1.0864 GB, available 26.7877
[HCTR][13:49:15.629][INFO][RK0][main]: Starting AUC NCCL warm-up
terminate called after throwing an instance of 'HugeCTR::core23::RuntimeError'
what(): Runtime error: an illegal memory access was encountered
cudaStreamSynchronize(stream) at run_finalize_step (/home/HugeCTR/HugeCTR/src/metrics.cu:1814)
[5583dc65ca3a:00137] *** Process received signal ***
[5583dc65ca3a:00137] Signal: Aborted (6)
[5583dc65ca3a:00137] Signal code: (-6)
[5583dc65ca3a:00137] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fc6ff3f6090]
[5583dc65ca3a:00137] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fc6ff3f600b]
[5583dc65ca3a:00137] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fc6ff3d5859]
[5583dc65ca3a:00137] [ 3] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x9e911)[0x7fc6f9b53911]
[5583dc65ca3a:00137] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xaa38c)[0x7fc6f9b5f38c]
[5583dc65ca3a:00137] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa9369)[0x7fc6f9b5e369]
[5583dc65ca3a:00137] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x2a1)[0x7fc6f9b5ed21]
[5583dc65ca3a:00137] [ 7] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x10bef)[0x7fc6f9aaabef]
[5583dc65ca3a:00137] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12a)[0x7fc6f9aab5aa]
[5583dc65ca3a:00137] [ 9] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR7metrics3AUCIfE23finalize_metric_per_gpuEi+0x397)[0x7fc6fa947ad7]
[5583dc65ca3a:00137] [10] /usr/local/hugectr/lib/libhuge_ctr_shared.so(+0xbd4b12)[0x7fc6fa947b12]
[5583dc65ca3a:00137] [11] /usr/lib/x86_64-linux-gnu/libgomp.so.1(GOMP_parallel+0x46)[0x7fc6938698e6]
[5583dc65ca3a:00137] [12] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR7metrics3AUCIfE15finalize_metricEv+0x9b)[0x7fc6fa8faf4b]
[5583dc65ca3a:00137] [13] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR7metrics3AUCIfE7warm_upEm+0xb6)[0x7fc6fa8ff1b6]
[5583dc65ca3a:00137] [14] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR7metrics3AUCIfEC2EiiiRKSt10shared_ptrINS_15ResourceManagerEEb+0x74a)[0x7fc6fa9326da]
[5583dc65ca3a:00137] [15] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR7metrics6Metric6CreateENS0_4TypeEbiiiRKSt10shared_ptrINS_15ResourceManagerEEb+0x1b1)[0x7fc6fa8fabc1]
[5583dc65ca3a:00137] [16] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR5Model14create_metricsEv+0xc5)[0x7fc6faa513d5]
[5583dc65ca3a:00137] [17] /usr/local/hugectr/lib/libhuge_ctr_shared.so(_ZN7HugeCTR5Model7compileEv+0x286)[0x7fc6faa55b36]
[5583dc65ca3a:00137] [18] /usr/local/hugectr/lib/hugectr.so(+0xce702)[0x7fc6fed3a702]
[5583dc65ca3a:00137] [19] /usr/local/hugectr/lib/hugectr.so(+0xd0f94)[0x7fc6fed3cf94]
[5583dc65ca3a:00137] [20] python(PyCFunction_Call+0x59)[0x5f6939]
[5583dc65ca3a:00137] [21] python(_PyObject_MakeTpCall+0x296)[0x5f7506]
[5583dc65ca3a:00137] [22] python[0x50b8d3]
[5583dc65ca3a:00137] [23] python(_PyEval_EvalFrameDefault+0x5796)[0x570556]
[5583dc65ca3a:00137] [24] python(_PyEval_EvalCodeWithName+0x26a)[0x5697da]
[5583dc65ca3a:00137] [25] python(PyEval_EvalCode+0x27)[0x68e547]
[5583dc65ca3a:00137] [26] python[0x67dbf1]
[5583dc65ca3a:00137] [27] python[0x67dc6f]
[5583dc65ca3a:00137] [28] python[0x67dd11]
[5583dc65ca3a:00137] [29] python(PyRun_SimpleFileExFlags+0x197)[0x67fe37]
[5583dc65ca3a:00137] *** End of error message ***
Aborted (core dumped)

这个metrics报告的内存问题可能是个bug?

@dusir dusir added the question Further information is requested label Sep 14, 2023
@minseokl minseokl changed the title [Question]单GPU卡H800跑hugectr dcn会遇到非法内存问题 [Question]Running the DCN on a single H800 leads to the illegal memory access Sep 18, 2023
@minseokl minseokl changed the title [Question]Running the DCN on a single H800 leads to the illegal memory access [Question]Running the DCN on a single GPU leads to the illegal memory access Sep 18, 2023
@minseokl
Copy link
Collaborator

Hi @dusir We can reproduce this issue on A100 as well. It is unrelated to H800 but the bug of our AUC implementation with a specific batch size. We are working on fixing it. Thanks!

@JacoCheung JacoCheung self-assigned this Sep 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested stage::doing
Projects
None yet
Development

No branches or pull requests

3 participants