Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

it seems that the gpu memory could not be free between iterations or rounds in benchmark, #215

Open
howin98 opened this issue Mar 29, 2022 · 1 comment

Comments

@howin98
Copy link

howin98 commented Mar 29, 2022

whether:

def test_alexnet_batch_size1(benchmark):
    benchmark.pedantic(run_alexnet_batch_size1, rounds=50)

or

def test_alexnet_batch_size1(benchmark):
    benchmark.pedantic(run_alexnet_batch_size1, iterations=50)

the output is:
platform linux -- Python 3.7.10, pytest-7.1.0, pluggy-1.0.0 -- /opt/conda/bin/python3
benchmark: 3.4.1 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /home/ci-user/runners/provision/_work/get-oneflow/get-oneflow/flow_vision
plugins: benchmark-3.4.1, forked-1.4.0, xdist-2.5.0
collecting ... collected 5 items

flow_vision/benchmark/test_alexnet.py::test_alexnet_batch_size16 loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1
W20220329 05:37:29.127727 139 cuda_allocator.cpp:282] OOM error is detected, process will exit. And it will start to reset CUDA device for releasing device memory.
F20220329 05:37:30.156129 139 cuda_allocator.cpp:285] Error! : Out of memory when allocate size : 150994944.
The total_memory_bytes allocated by this CudaAllocator is : 4907335680
*** Check failure stack trace: ***
@ 0x7fdf4caaa2ea (unknown)
@ 0x7fdf4caaa5d2 (unknown)
@ 0x7fdf4caa9e57 (unknown)
@ 0x7fdf4caac9c9 (unknown)
@ 0x7fdf46ed1c3a oneflow::vm::CudaAllocator::Allocate()
@ 0x7fdf46edfb2d oneflow::vm::ThreadSafeAllocator::Allocate()
@ 0x7fdf44875120 oneflow::vm::EagerBlobObject::TryAllocateBlobBodyMemory()
@ 0x7fdf4487cd5f oneflow::vm::LocalCallOpKernelUtil::AllocateOutputBlobsMemory()
@ 0x7fdf4487d6bf oneflow::vm::LocalCallOpKernelUtil::Compute()
@ 0x7fdf4487c68b oneflow::vm::LocalCallOpKernelInstructionType::ComputeInFuseMode()
@ 0x7fdf46ed7ce6 oneflow::vm::FuseInstructionType<>::Compute()
@ 0x7fdf46ed680a oneflow::vm::CudaStreamType::Compute()
@ 0x7fdf46eebd94 oneflow::vm::VirtualMachineEngine::DispatchInstruction()
@ 0x7fdf46eec94f oneflow::vm::VirtualMachineEngine::DispatchAndPrescheduleInstructions()
@ 0x7fdf46ef1d18 oneflow::vm::VirtualMachineEngine::Schedule()
@ 0x7fdf46ee2a10 oneflow::VirtualMachine::ScheduleLoop()
@ 0x7fdf4f4bd82f (unknown)
@ 0x7fdf8a2bc6db start_thread
@ 0x7fdf89fe561f clone
Fatal Python error: Aborted

Thread 0x00007fdf8a6ed740 (most recent call first):
File "/opt/conda/lib/python3.7/site-packages/oneflow/framework/tensor.py", line 985 in _numpy
File "/home/ci-user/runners/provision/_work/get-oneflow/get-oneflow/flow_vision/benchmark/test_alexnet.py", line 22 in run_alexnet_batch_size16
File "/opt/conda/lib/python3.7/site-packages/pytest_benchmark/fixture.py", line 97 in runner
File "/opt/conda/lib/python3.7/site-packages/pytest_benchmark/fixture.py", line 222 in _raw_pedantic
File "/opt/conda/lib/python3.7/site-packages/pytest_benchmark/fixture.py", line 140 in pedantic
File "/home/ci-user/runners/provision/_work/get-oneflow/get-oneflow/flow_vision/benchmark/test_alexnet.py", line 27 in test_alexnet_batch_size16
File "/opt/conda/lib/python3.7/site-packages/_pytest/python.py", line 192 in pytest_pyfunc_call
File "/opt/conda/lib/python3.7/site-packages/pluggy/_callers.py", line 39 in _multicall
File "/opt/conda/lib/python3.7/site-packages/pluggy/_manager.py", line 80 in _hookexec
File "/opt/conda/lib/python3.7/site-packages/pluggy/_hooks.py", line 265 in call
File "/opt/conda/lib/python3.7/site-packages/_pytest/python.py", line 1761 in runtest
File "/opt/conda/lib/python3.7/site-packages/_pytest/runner.py", line 166 in pytest_runtest_call
File "/opt/conda/lib/python3.7/site-packages/pluggy/_callers.py", line 39 in _multicall
File "/opt/conda/lib/python3.7/site-packages/pluggy/_manager.py", line 80 in _hookexec
File "/opt/conda/lib/python3.7/site-packages/pluggy/_hooks.py", line 265 in call
File "/opt/conda/lib/python3.7/site-packages/_pytest/runner.py", line 259 in
File "/opt/conda/lib/python3.7/site-packages/_pytest/runner.py", line 338 in from_call
File "/opt/conda/lib/python3.7/site-packages/_pytest/runner.py", line 259 in call_runtest_hook
File "/opt/conda/lib/python3.7/site-packages/_pytest/runner.py", line 219 in call_and_report
File "/opt/conda/lib/python3.7/site-packages/_pytest/runner.py", line 130 in runtestprotocol
File "/opt/conda/lib/python3.7/site-packages/_pytest/runner.py", line 111 in pytest_runtest_protocol
File "/opt/conda/lib/python3.7/site-packages/pluggy/_callers.py", line 39 in _multicall
File "/opt/conda/lib/python3.7/site-packages/pluggy/_manager.py", line 80 in _hookexec
File "/opt/conda/lib/python3.7/site-packages/pluggy/_hooks.py", line 265 in call
File "/opt/conda/lib/python3.7/site-packages/_pytest/main.py", line 347 in pytest_runtestloop
File "/opt/conda/lib/python3.7/site-packages/pluggy/_callers.py", line 39 in _multicall
File "/opt/conda/lib/python3.7/site-packages/pluggy/_manager.py", line 80 in _hookexec
File "/opt/conda/lib/python3.7/site-packages/pluggy/_hooks.py", line 265 in call
File "/opt/conda/lib/python3.7/site-packages/_pytest/main.py", line 322 in _main
File "/opt/conda/lib/python3.7/site-packages/_pytest/main.py", line 268 in wrap_session
File "/opt/conda/lib/python3.7/site-packages/_pytest/main.py", line 315 in pytest_cmdline_main
File "/opt/conda/lib/python3.7/site-packages/pluggy/_callers.py", line 39 in _multicall
File "/opt/conda/lib/python3.7/site-packages/pluggy/_manager.py", line 80 in _hookexec
File "/opt/conda/lib/python3.7/site-packages/pluggy/_hooks.py", line 265 in call
File "/opt/conda/lib/python3.7/site-packages/_pytest/config/init.py", line 165 in main
File "/opt/conda/lib/python3.7/site-packages/_pytest/config/init.py", line 187 in console_main
File "/opt/conda/lib/python3.7/site-packages/pytest/main.py", line 5 in
File "/opt/conda/lib/python3.7/runpy.py", line 85 in _run_code
File "/opt/conda/lib/python3.7/runpy.py", line 193 in _run_module_as_main
Error: Error: The process '/usr/bin/docker' failed with exit code 134

@howin98
Copy link
Author

howin98 commented Mar 29, 2022

tks a lot for any suggestions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant