Memory leak during DINO training. #322

lolikonloli · 2023-12-06T14:10:05Z

Divice info

sys.platform linux
Python 3.10.0 (default, Mar 3 2022, 09:58:08) [GCC 7.5.0]
numpy 1.22.4
detectron2 0.6 @/home/lolikonloli/code/detection/package/detrex/detectron2/detectron2
Compiler GCC 11.4
CUDA compiler CUDA 11.8
detectron2 arch flags 7.5
DETECTRON2_ENV_MODULE
PyTorch 2.0.1+cu118 @/home/lolikonloli/anaconda3/envs/pl_det/lib/python3.10/site-packages/torch
PyTorch debug build False
GPU available Yes
GPU 0,1 NVIDIA GeForce RTX 2080 Ti (arch=7.5)
Driver version 535.104.05
CUDA_HOME /usr/local/cuda-11.8
Pillow 9.3.0
torchvision 0.15.2+cu118 @/home/lolikonloli/anaconda3/envs/pl_det/lib/python3.10/site-packages/torchvision
torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
fvcore 0.1.5.post20221221
iopath 0.1.9
cv2 4.8.0

PyTorch built with:

GCC 9.3
C++ Version: 201703
Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.8
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
CuDNN 8.7
Magma 2.6.1
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF

describe

Memory continuously increases during DINO training with two 2080ti GPUs until it gets killed by the system.

rentainhe · 2023-12-11T02:53:11Z

Hello, it's a normal, because of the multi-scale training and denoising query, the model's memory usage is not that stable, it may takes about more than 12GB of 2080Ti, you can try to use fp16 training or lower the total_batch_size to skip this issue, or you can try to add activation checkpoint to reduce the memory usage of the total model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak during DINO training. #322

Memory leak during DINO training. #322

lolikonloli commented Dec 6, 2023

rentainhe commented Dec 11, 2023

Memory leak during DINO training. #322

Memory leak during DINO training. #322

Comments

lolikonloli commented Dec 6, 2023

Divice info

describe

rentainhe commented Dec 11, 2023