Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error running evaluation #9

Open
jarvishou829 opened this issue Aug 30, 2023 · 3 comments
Open

Error running evaluation #9

jarvishou829 opened this issue Aug 30, 2023 · 3 comments

Comments

@jarvishou829
Copy link

jarvishou829 commented Aug 30, 2023


warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
/opt/conda/envs/fbocc/lib/python3.8/site-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
/opt/conda/envs/fbocc/lib/python3.8/site-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
[                                                  ] 4/6019, 0.4 task/s, elapsed: 9s, ETA: 14005s/opt/conda/envs/fbocc/lib/python3.8/site-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
[>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>] 6084/6019, 21.4 task/s, elapsed: 284s, ETA:    -2sWARNING:torch.distributed.elastic.multiprocessing.api:Sending process 795432 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 795433 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 795434 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 3 (pid: 795435) of binary: /opt/conda/envs/fbocc/bin/python
Traceback (most recent call last):
  File "/opt/conda/envs/fbocc/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/envs/fbocc/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/.vscode-server/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 39, in <module>
    cli.main()
  File "/root/.vscode-server/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
    run()
  File "/root/.vscode-server/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 317, in run_module
    run_module_as_main(options.target, alter_argv=True)
  File "/root/.vscode-server/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 238, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/.vscode-server/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
    exec(code, run_globals)
  File "/opt/conda/envs/fbocc/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/opt/conda/envs/fbocc/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/opt/conda/envs/fbocc/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/opt/conda/envs/fbocc/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/opt/conda/envs/fbocc/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/envs/fbocc/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
tools_mm/test.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-08-30_01:11:31
  host      : fd-taxi-f-houjiawei-1691719963348-985c0990-3609275187
  rank      : 3 (local_rank: 3)
  exitcode  : -9 (pid: 795435)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 795435
============================================================

The evaluation run 6084 samples rather than 6019 samples and closed unexpectly.

@jarvishou829
Copy link
Author

It seems to be a memory overflow issue. The memory occupied increases abnormally when the evaluation process is almost finished.

@lubinBoooos
Copy link

You can try do evaluation only on one GPU

@tanatomoe
Copy link

Hi, I get the same error. I've read you can fix it by changing batch size, but unfortunately I can't figure out how to do that. Perhaps you could try it, and if it works tell me how to do it? It would be greatly appreaciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants