Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Got ChkStopThr and IntMsgThr after the training finished #7468

Open
ease-zh opened this issue Apr 24, 2024 · 6 comments
Open

Got ChkStopThr and IntMsgThr after the training finished #7468

ease-zh opened this issue Apr 24, 2024 · 6 comments
Labels
a:cli Area: Client c:core Component: Core s:nexus-fix Stage: will be fixed with the new sdk backend

Comments

@ease-zh
Copy link

ease-zh commented Apr 24, 2024

After the training was done, wandb logs "Run history", "Run summary" and "Find logs at ...", then it throws two exceptions: ChkStopThr and IntMsgThr. I was using Ubuntu, and already set "WANDB_START_METHOD=thread" as said in #3223 , unfortunately, it did not work for me.
Below is the logs in console:

wandb: / 2.057 MB of 12.459 MB uploaded (0.003 MB deduped)
wandb: Run history:
wandb: Loss/Step Loss █▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: Loss/Train Loss █▆▄▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb: Process/Epoch ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇████
wandb: Process/LR ▂▅█▆▄▁▄▄▄▃▃▂▁▁▃▃▃▃▂▂▂▂▂▂▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂
wandb: Process/Step ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb:
wandb: Run summary:
wandb: Loss/Step Loss 5.90382
wandb: Loss/Train Loss 6.65646
wandb: Process/Epoch 59
wandb: Process/LR 0.00097
wandb: Process/Step 606960
wandb:
wandb: 🚀 View run 240422_1425_GPU0_lr0.01-vpl-coslr at: http://localhost:8080/zhangyi/Face%20Recognition/runs/2q5cauhb/workspace
wandb: Synced 7 W&B file(s), 0 media file(s), 3 artifact file(s) and 1 other file(s)
wandb: Find logs at: ./wandb/run-20240422_142504-2q5cauhb/logs
Exception in thread IntMsgThr:
Traceback (most recent call last):
File "/data_111/miniconda3/envs/insightface/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
Exception in thread ChkStopThr:
Traceback (most recent call last):
File "/data_111/miniconda3/envs/insightface/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
self.run()
File "/data_111/miniconda3/envs/insightface/lib/python3.11/threading.py", line 982, in run
self.run()
File "/data_111/miniconda3/envs/insightface/lib/python3.11/threading.py", line 982, in run
self._target(*self._args, **self._kwargs)
self._target(*self._args, **self._kwargs)
File "/data_111/miniconda3/envs/insightface/lib/python3.11/site-packages/wandb/sdk/wandb_run.py", line 300, in check_internal_messages
File "/data_111/miniconda3/envs/insightface/lib/python3.11/site-packages/wandb/sdk/wandb_run.py", line 286, in check_stop_status
self._loop_check_status(
File "/data_111/miniconda3/envs/insightface/lib/python3.11/site-packages/wandb/sdk/wandb_run.py", line 224, in _loop_check_status
local_handle = request()
^^^^^^^^^
self._loop_check_status(
File "/data_111/miniconda3/envs/insightface/lib/python3.11/site-packages/wandb/sdk/interface/interface.py", line 844, in deliver_internal_messages
File "/data_111/miniconda3/envs/insightface/lib/python3.11/site-packages/wandb/sdk/wandb_run.py", line 224, in _loop_check_status
local_handle = request()
^^^^^^^^^
File "/data_111/miniconda3/envs/insightface/lib/python3.11/site-packages/wandb/sdk/interface/interface.py", line 828, in deliver_stop_status
return self._deliver_stop_status(status)return self._deliver_internal_messages(internal_message)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data_111/miniconda3/envs/insightface/lib/python3.11/site-packages/wandb/sdk/interface/interface_shared.py", line 494, in _deliver_stop_status
File "/data_111/miniconda3/envs/insightface/lib/python3.11/site-packages/wandb/sdk/interface/interface_shared.py", line 516, in _deliver_internal_messages
return self._deliver_record(record)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
return self._deliver_record(record)
File "/data_111/miniconda3/envs/insightface/lib/python3.11/site-packages/wandb/sdk/interface/interface_shared.py", line 459, in _deliver_record
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
handle = mailbox._deliver_record(record, interface=self)
File "/data_111/miniconda3/envs/insightface/lib/python3.11/site-packages/wandb/sdk/interface/interface_shared.py", line 459, in _deliver_record
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data_111/miniconda3/envs/insightface/lib/python3.11/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record
handle = mailbox._deliver_record(record, interface=self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data_111/miniconda3/envs/insightface/lib/python3.11/site-packages/wandb/sdk/lib/mailbox.py", line 455, in _deliver_record
interface._publish(record)
File "/data_111/miniconda3/envs/insightface/lib/python3.11/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
interface._publish(record)
File "/data_111/miniconda3/envs/insightface/lib/python3.11/site-packages/wandb/sdk/interface/interface_sock.py", line 51, in _publish
self._sock_client.send_record_publish(record)
File "/data_111/miniconda3/envs/insightface/lib/python3.11/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
self._sock_client.send_record_publish(record)
File "/data_111/miniconda3/envs/insightface/lib/python3.11/site-packages/wandb/sdk/lib/sock_client.py", line 221, in send_record_publish
self.send_server_request(server_req)
File "/data_111/miniconda3/envs/insightface/lib/python3.11/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
self.send_server_request(server_req)
File "/data_111/miniconda3/envs/insightface/lib/python3.11/site-packages/wandb/sdk/lib/sock_client.py", line 155, in send_server_request
self._send_message(msg)
File "/data_111/miniconda3/envs/insightface/lib/python3.11/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
self._send_message(msg)
File "/data_111/miniconda3/envs/insightface/lib/python3.11/site-packages/wandb/sdk/lib/sock_client.py", line 152, in _send_message
self._sendall_with_error_handle(header + data)
File "/data_111/miniconda3/envs/insightface/lib/python3.11/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
self._sendall_with_error_handle(header + data)
File "/data_111/miniconda3/envs/insightface/lib/python3.11/site-packages/wandb/sdk/lib/sock_client.py", line 130, in _sendall_with_error_handle
sent = self._sock.send(data)
^^^^^^^^^^^^^^^^^^^^^
sent = self._sock.send(data)
BrokenPipeError: [Errno 32] Broken pipe
^^^^^^^^^^^^^^^^^^^^^
BrokenPipeError: [Errno 32] Broken pipe

And the debug.log.
And the debug-internal.log is too large to upload here, please tell me which part is useful to solve the problem.

Also, this problem only occurred in times, even I run the same code.

@kptkin kptkin added a:cli Area: Client c:core Component: Core s:nexus-fix Stage: will be fixed with the new sdk backend labels Apr 24, 2024
@luisbergua
Copy link
Contributor

Hey @ease-zh, thanks for flagging this! Would you mind sending the debug-internal.log to [email protected]? I'll take a look at it

@luisbergua
Copy link
Contributor

Hi @ease-zh, thanks for sharing the debug-internal.log! I took a look at it but didn't see any error messages, mind checking if you see any on your local file and sharing those here if so?

@fdsig
Copy link
Member

fdsig commented May 3, 2024

Hey @ease-zh -- following up here for @luisbergua did you get a chance to locate the debug-internal.log feel free to also send them to me at [email protected],

Look forward to hearing back

@ease-zh
Copy link
Author

ease-zh commented May 6, 2024

@luisbergua No, there are no other error logs, everything seems good.

@luisbergua
Copy link
Contributor

Hi @ease-zh, thanks for confirming this! It might be then that the error is on the server side since it seems you're running wandb on a local server. Have you been able to execute runs successfully in the past using that server? Also, could you please reproduce the error and right after pull the Debug Bundle of the instance and share it with us?

@ease-zh
Copy link
Author

ease-zh commented May 8, 2024

@luisbergua Yes, most time I can execute runs successfully. The errors were reported frequently only when I raised the issue, and recent tasks could run normally.
Oh, I just updated the wandb version, maybe it solved the problem?
Anyway, the next time I get the error, I will share the debug bundle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:cli Area: Client c:core Component: Core s:nexus-fix Stage: will be fixed with the new sdk backend
Projects
None yet
Development

No branches or pull requests

4 participants