-
Notifications
You must be signed in to change notification settings - Fork 340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
multi-machine, multi-gpu sok core dump #838
Comments
这是来自QQ邮箱的假期自动回复邮件。
您好,我最近正在休假中,无法亲自回复您的邮件。我将在假期结束后,尽快给您回复。
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
System information
Describe the current behavior
[1,2]:[n193-019-222:14623] [ 1] /opt/tiger/jdk/jdk1.8/jre/lib/amd64/server/libjvm.so(JVM_handle_linux_signal+0xb6)[0x7fb6a01cf826]
[1,2]:[n193-019-222:14623] [ 2] /opt/tiger/jdk/jdk1.8/jre/lib/amd64/server/libjvm.so(+0x921e13)[0x7fb6a01c5e13]
[1,2]:[n193-019-222:14623] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fb789fe2090]
[1,2]:[n193-019-222:14623] [ 4] /usr/lib/python3.8/site-packages/merlin_sok-1.1.4-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libcore.so(ZNSt8__detail9_Map_baseIN4core6DeviceESt4pairIKS2_St10shared_ptrINS1_12IStorageImplEEESaIS8_ENS_10_Select1stESt8equal_toIS2_ESt4hashIS2_ENS_18_Mod_range_hashingENS_20_Default_ranged_hashENS_20_Prime_rehash_policyENS_17_Hashtable_traitsILb1ELb0ELb1EEELb1EEixERS4+0x173)[0x7fb5a5025e43]
[1,2]:[n193-019-222:14623] [ 5] /usr/lib/python3.8/site-packages/merlin_sok-1.1.4-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libcore.so(_ZN4core10BufferImpl7reserveERKNS_5ShapeENS_6DeviceENS_8DataTypeEm+0x313)[0x7fb5a5025143]
[1,2]:[n193-019-222:14623] [ 6] /usr/lib/python3.8/site-packages/merlin_sok-1.1.4-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libembedding.so(_ZN9embedding33UniformModelParallelEmbeddingMetaC1ESt10shared_ptrIN4core19CoreResourceManagerEERKNS_24EmbeddingCollectionParamEm+0x2559)[0x7fb5a3627879]
[1,2]:[n193-019-222:14623] [ 7] /usr/lib/python3.8/site-packages/merlin_sok-1.1.4-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libsok_experiment.so(_ZN10tensorflow23EmbeddingCollectionBaseIxxfE11update_metaESt10shared_ptrIN4core19CoreResourceManagerEEiRSt6vectorIiSaIiEE+0x131)[0x7fb5a30162e1]
[1,2]:[n193-019-222:14623] [ 8] /usr/lib/python3.8/site-packages/merlin_sok-1.1.4-py3.8-linux-x86_64.egg/sparse_operation_kit/lib/libsok_experiment.so(_ZN10tensorflow30LookupForwardEmbeddingVarGPUOpIxxfE7ComputeEPNS_15OpKernelContextE+0x891)[0x7fb5a303d9f1]
[1,2]:[n193-019-222:14623] [ 9] /usr/local/lib/python3.8/dist-packages/tensorflow_core/python/../libtensorflow_framework.so.1(_ZN10tensorflow13BaseGPUDevice7ComputeEPNS_8OpKernelEPNS_15OpKernelContextE+0xdc)[0x7fb6a1fa3bbc]
[1,2]:[n193-019-222:14623] [10] [n193-019-222:14623] [ 0] [1,4]:[n193-019-222:14625] *** Process received signal ***
Describe the expected behavior
Code to reproduce the issue
modelzoo/deepfm
, with no code modifympirun -np 16 --map-by ppr:4:socket -bind-to socket --hostfile ./hostfile --allow-run-as-root --tag-output --report-bindings --mca pml ob1 --mca btl ^openib --mca btl_tcp_if_exclude lo,docker0,bond0 --wdir /home/tiger/deeprec -x NCCL_IB_DISABLE=0 -x NCCL_IB_GID_INDEX=3 -x NCCL_IB_HCA=mlx5 -x NCCL_DEBUG=INFO -x NCCL_IB_TIMEOUT=25 -x NCCL_IB_RETRY_CNT=7 -x NCCL_SOCKET_IFNAME=eth0 -x HOROVOD_MPI_THREADS_DISABLE=0 -x TF_GPU_CUPTI_FORCE_CONCURRENT_KERNEL=1 -x YARN_CONTAINER_RESOURCE_PREFIX_VCORES -x NV_LIBCUBLAS_DEV_PACKAGE_NAME -x HTTPS_PROXY -x TOTAL_ORACLES -x NV_LIBCUBLAS_PACKAGE -x GLOG_log_dir -x NV_LIBNCCL_DEV_PACKAGE_VERSION -x YARN_APP_ID -x NM_LABEL -x YARN_CONTAINER_RESOURCE_PREFIX_YARN_IO_TPU_V3_POD -x OOM_LISTEN_MODE -x SEC_TOKEN_PATH -x YARN_CONTAINER_RESOURCE_PREFIX_YARN_IO_PORT -x NVIDIA_PRODUCT_NAME -x PRIMUS_AM_RPC_PORT -x NV_LIBCUSPARSE_DEV_VERSION -x NUM_OF_PRIMUS_worker -x YARN_CONTAINER_RUNTIME_DOCKER_IMAGE -x NV_CUDNN_VERSION -x NV_LIBNPP_DEV_VERSION -x CUDA_VERSION -x PATH -x HTTP_PROXY -x NV_LIBNPP_DEV_PACKAGE -x API_SERVER_PORT -x NV_CUDNN_PACKAGE_NAME -x PRIMUS_ROLE_CATEGORY -x YARN_CLASS_ID -x LIBHDFS_OPTS -x ENV_DOCKER_CONTAINER_SECURITY_OPTION -x NV_LIBNCCL_DEV_PACKAGE_NAME -x ENABLE_OOM_LISTENER -x NM_PORT -x API_SERVER_HOST -x NCCL_VERSION -x NM_HTTP_PORT -x NV_LIBNCCL_PACKAGE_VERSION -x YARN_APP_PRIORITY -x YARN_APP_TYPE -x START_STATISTIC_STEP -x NVIDIA_DRIVER_CAPABILITIES -x TZ -x SHUFFLE_DISK_MANAGER_PORT -x YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS -x NM_AUX_SERVICE_mapreduce_shuffle -x SEC_KV_AUTH -x TF_SCRIPT -x CLASSPATH -x LOCAL_DIRS -x HADOOP_YARN_HOME -x NV_LIBCUBLAS_DEV_VERSION -x HADOOP_CONF_DIR -x NO_PROXY -x LIBRARY_PATH -x NV_LIBNPP_PACKAGE -x PRIMUS_EXECUTOR_UNIQUE_ID -x PRIMUS_AM_RPC_HOST -x NV_NVPROF_DEV_PACKAGE -x NV_NVML_DEV_VERSION -x YARN_CONTAINER_RESOURCE_PREFIX_MEMORY_MB -x YARN_CONTAINER_RESOURCE_PREFIX_YARN_IO_TPU_V3_BASE -x NV_CUDA_LIB_VERSION -x RUNTIME_IDC_NAME -x TF_CONFIG -x YARN_APP_TAGS -x NV_LIBCUBLAS_DEV_PACKAGE -x LC_CTYPE -x NVARCH -x NV_CUDA_CUDART_DEV_VERSION -x NLSPATH -x ENV_DOCKER_CONTAINER_SHM_SIZE -x SHLVL -x TF_WORKSPACE -x JEMALLOC_PATH -x XFILESEARCHPATH -x SPARK_3_SHUFFLE_SERVICE_PORT -x NV_LIBCUBLAS_PACKAGE_NAME -x NM_HOST -x PRIMUS_SUBMIT_TIMESTAMP -x STOP_STATISTIC_STEP -x PYTHONPATH -x NV_LIBNCCL_PACKAGE_NAME -x YARN_QUEUE_ID -x ENV_DOCKER_CONTAINER_DEVICE -x ROLES_LIST -x YARN_USER -x LOAD_SERVICE_PSM -x YARN_CONTAINER_RESOURCE_PREFIX_YARN_IO_GPU -x PRIMUS_EXECUTOR_UNIQID -x NV_NVPROF_VERSION -x JAVA_HOME -x NVIDIA_REQUIRE_CUDA -x YARN_CONTAINER_RUNTIME_TYPE -x SPARK_SHUFFLE_SERVICE_PORT -x ENV_DOCKER_CONTAINER_CAP_ADD -x MALLOC_ARENA_MAX -x SSD_MANAGER_PORT -x YARN_QUEUE_NAME -x NV_NVTX_VERSION -x YODEL_MODE -x NV_CUDA_CUDART_VERSION -x BYTED_HOST_IPV6 -x NV_CUDA_COMPAT_PACKAGE -x LD_LIBRARY_PATH -x HADOOP_TOKEN_FILE_LOCATION -x LOG_DIRS -x APPLICATION_ID -x HOME -x NV_LIBCUSPARSE_VERSION -x HADOOP_COMMON_HOME -x HADOOP_HDFS_HOME -x OLDPWD -x NV_LIBNCCL_PACKAGE -x MEM_USAGE_STRATEGY -x PWD -x NV_LIBCUBLAS_VERSION -x ENV_DOCKER_CONTAINER_ULIMIT -x LOGNAME -x NV_CUDNN_PACKAGE -x PRIMUS_STAGING_DIR -x NV_LIBNCCL_DEV_PACKAGE -x NVIDIA_VISIBLE_DEVICES -x NV_LIBNPP_VERSION -x YARN_CONTAINER_RESOURCE_PREFIX_VCORES_MILLI -x HADOOP_HOME -x CORE_DUMP_PROC_NAME -x NV_CUDNN_PACKAGE_DEV -x USER python3 train.py --output_dir=hdfs://harunava/user/xxx/deeprec_v10 --data_location=hdfs://harunava/user/xxx/criteo_small --protocol=grpc --smartstaged=false --batch_size=2048 --steps=30000 --ev=true --ev_elimination=l2 --ev_filter=counter --op_fusion=true --input_layer_partitioner=0 --dense_layer_partitioner=16 --group_embedding=collective --workqueue=true --parquet_dataset=false
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
The text was updated successfully, but these errors were encountered: