-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OSError: [Errno 98] Address already in use #1113
Comments
Can you retry it with |
Can you give me a tested version of megatron-lm? When I use megatron-lm-main, I get some errors. |
You can test it with the repo https://github.com/workingloong/Megatron-LM-CKPT forked from Megatron-LM in 2024.02. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
dlrover version:v0.3.5
megatron version:main
I encountered an error when using flash checkpoint in megatron:
Exception in thread checkpoint-saver:
Traceback (most recent call last):
File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/usr/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 422, in _saver
saver: AsyncCheckpointSaver = class_def(**class_meta.kwargs)
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 386, in init
self._event_queue = SharedQueue(name=qname, create=True)
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/multi_process.py", line 369, in init
super().init(name, create)
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/multi_process.py", line 188, in init
self._init_socket()
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/multi_process.py", line 210, in _init_socket
self._server = _create_socket_server(self._socket_file)
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/common/multi_process.py", line 71, in _create_socket_server
server.bind(path)
OSError: [Errno 98] Address already in use
Exception ignored in: <function AsyncCheckpointSaver.del at 0x7efed6fbb490>
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 402, in del
[2024-05-10 07:57:02,115] [INFO] [ckpt_saver.py:429:_factory] Start the checkpoint saver factory.
self.close()
File "/usr/local/lib/python3.10/dist-packages/dlrover/python/elastic_agent/torch/ckpt_saver.py", line 494, in close
if not self._event_queue.empty():
AttributeError: 'MegatronCheckpointSaver' object has no attribute '_event_queue'
The text was updated successfully, but these errors were encountered: