You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the issue:
I created the trial by nnictl create --config xx --p xxxx
For a while I use nnictl experiment --all to check it, and find it stopped. The dispatcher.log shows the error below.
But the corresponding process is still running in gpu.
btw in the last time I use nni, this error didn't occur. I don't know what caused it.
Environment:
NNI version: 2.10.1
Training service (local|remote|pai|aml|etc): local
nnimanager.log:
[2024-04-12 18:48:34] INFO (main) Start NNI manager
[2024-04-12 18:48:34] INFO (NNIDataStore) Datastore initialization done
[2024-04-12 18:48:34] INFO (RestServer) Starting REST server at port 8080, URL prefix: "/"
[2024-04-12 18:48:34] INFO (RestServer) REST server started.
[2024-04-12 18:48:35] INFO (NNIManager) Starting experiment: b7edpl94
[2024-04-12 18:48:35] INFO (NNIManager) Setup training service...
[2024-04-12 18:48:35] INFO (LocalTrainingService) Construct local machine training service.
[2024-04-12 18:48:35] INFO (NNIManager) Setup tuner...
[2024-04-12 18:48:35] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING
[2024-04-12 18:48:36] INFO (NNIManager) Add event listeners
[2024-04-12 18:48:36] INFO (LocalTrainingService) Run local machine training service.
[2024-04-12 18:48:36] INFO (NNIManager) NNIManager received command from dispatcher: ID,
[2024-04-12 18:48:36] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"lr": 0.0002, "beta1": 0.0001, "beta2": 0.999, "lambda_e": 5e-05}, "parameter_index": 0}
[2024-04-12 18:48:36] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"lr": 0.001, "beta1": 1e-05, "beta2": 0.9, "lambda_e": 5e-05}, "parameter_index": 0}
[2024-04-12 18:48:41] INFO (NNIManager) submitTrialJob: form: {
sequenceId: 0,
hyperParameters: {
value: '{"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"lr": 0.0002, "beta1": 0.0001, "beta2": 0.999, "lambda_e": 5e-05}, "parameter_index": 0}',
index: 0
},
placementConstraint: { type: 'None', gpus: [] }
}
[2024-04-12 18:48:41] INFO (NNIManager) submitTrialJob: form: {
sequenceId: 1,
hyperParameters: {
value: '{"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"lr": 0.001, "beta1": 1e-05, "beta2": 0.9, "lambda_e": 5e-05}, "parameter_index": 0}',
index: 0
},
placementConstraint: { type: 'None', gpus: [] }
}
[2024-04-12 18:48:51] INFO (NNIManager) Trial job ZlXeN status changed from WAITING to RUNNING
[2024-04-12 18:48:51] INFO (NNIManager) Trial job Rh0Pn status changed from WAITING to RUNNING
[2024-04-12 18:49:42] ERROR (tuner_command_channel.WebSocketChannel) Error: Error: tuner_command_channel: Tuner closed connection
at WebSocket.handleWsClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:83:26)
at WebSocket.emit (node:events:538:35)
at WebSocket.emitClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:246:10)
at Socket.socketOnClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:1127:15)
at Socket.emit (node:events:526:28)
at TCP. (node:net:687:12)
dispatcher.log:
[2024-04-12 18:48:35] INFO (numexpr.utils/MainThread) Note: NumExpr detected 64 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
[2024-04-12 18:48:35] INFO (numexpr.utils/MainThread) NumExpr defaulting to 8 threads.
[2024-04-12 18:48:36] INFO (nni.tuner.tpe/MainThread) Using random seed 1314744945
[2024-04-12 18:48:36] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started
[2024-04-12 18:49:19] ERROR (nni.runtime.msg_dispatcher_base/Thread-2) Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
Traceback (most recent call last):
File "/home/yiran/.local/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 108, in command_queue_worker
self.process_command(command, data)
File "/home/yiran/.local/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 154, in process_command
command_handlerscommand
File "/home/yiran/.local/lib/python3.8/site-packages/nni/runtime/msg_dispatcher.py", line 144, in handle_report_metric_data
data['value'] = load(data['value'])
File "/home/yiran/.local/lib/python3.8/site-packages/nni/common/serializer.py", line 443, in load
return json_tricks.loads(string, obj_pairs_hooks=hooks, **json_tricks_kwargs)
File "/home/yiran/.local/lib/python3.8/site-packages/json_tricks/nonp.py", line 259, in loads
return _strip_loads(string, hook, True, **jsonkwargs)
File "/home/yiran/.local/lib/python3.8/site-packages/json_tricks/nonp.py", line 266, in _strip_loads
return json_loads(string, object_pairs_hook=object_pairs_hook, **jsonkwargs)
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/json/init.py", line 370, in loads
return cls(**kw).decode(s)
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
File "/home/yiran/.local/lib/python3.8/site-packages/json_tricks/decoders.py", line 46, in call
map = hook(map, properties=self.properties)
File "/home/yiran/.local/lib/python3.8/site-packages/json_tricks/utils.py", line 66, in wrapper
return encoder(*args, **{k: v for k, v in kwargs.items() if k in names})
File "/home/yiran/.local/lib/python3.8/site-packages/nni/common/serializer.py", line 877, in _json_tricks_any_object_decode
return _wrapped_cloudpickle_loads(b)
File "/home/yiran/.local/lib/python3.8/site-packages/nni/common/serializer.py", line 883, in _wrapped_cloudpickle_loads
return cloudpickle.loads(b)
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/storage.py", line 161, in _load_from_bytes
return torch.load(io.BytesIO(b))
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 608, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 787, in _legacy_load
result = unpickler.load()
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 743, in persistent_load
deserialized_objects[root_key] = restore_location(obj, location)
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 175, in default_restore_location
result = fn(storage, location)
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 151, in _cuda_deserialize
device = validate_cuda_device(location)
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 135, in validate_cuda_device
raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
[2024-04-12 18:49:40] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher exiting...
[2024-04-12 18:49:42] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher terminiated
Error: tuner_command_channel: Tuner closed connection
at WebSocket.handleWsClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:83:26)
at WebSocket.emit (node:events:538:35)
at WebSocket.emitClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:246:10)
at Socket.socketOnClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:1127:15)
at Socket.emit (node:events:526:28)
at TCP. (node:net:687:12)
Emitted 'error' event at:
at WebSocketChannelImpl.handleError (/home/yiran/.local/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:135:22)
at WebSocket.handleWsClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:83:14)
at WebSocket.emit (node:events:538:35)
[... lines matching original stack trace ...]
at TCP. (node:net:687:12)
Thrown at:
at handleWsClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:83:26)
at emit (node:events:538:35)
at emitClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:246:10)
at socketOnClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:1127:15)
at emit (node:events:526:28)
at node:net:687:12
How to reproduce it?:
The text was updated successfully, but these errors were encountered:
Describe the issue:
I created the trial by
nnictl create --config xx --p xxxx
For a while I use
nnictl experiment --all
to check it, and find it stopped. The dispatcher.log shows the error below.But the corresponding process is still running in gpu.
btw in the last time I use nni, this error didn't occur. I don't know what caused it.
Environment:
Configuration:
Experiment config (remember to remove secrets!):
trialCommand: CUDA_VISIBLE_DEVICES=0 python k+1_gan.py
trialConcurrency: 2
maxTrialNumber: 1000
maxExperimentDuration: 200h
experimentWorkingDirectory: "/home/yiran/codes/Knowledge-Enriched-DMI/nni-experiment"
tuner:
name: TPE
classArgs:
optimize_mode: maximize
trainingService:
platform: local
Search space:
{
"lr":{"_type":"choice","_value":[0.00005, 0.0001,0.0002, 0.0005, 0.001]},
"beta1":{"_type":"choice","_value":[0.001, 0.0001, 0.00001]},
"beta2": {"_type":"choice","_value":[0.9,0.999]},
"lambda_e": {"_type":"choice","_value":[0.00005]}
}
Log message:
nnimanager.log:
[2024-04-12 18:48:34] INFO (main) Start NNI manager
[2024-04-12 18:48:34] INFO (NNIDataStore) Datastore initialization done
[2024-04-12 18:48:34] INFO (RestServer) Starting REST server at port 8080, URL prefix: "/"
[2024-04-12 18:48:34] INFO (RestServer) REST server started.
[2024-04-12 18:48:35] INFO (NNIManager) Starting experiment: b7edpl94
[2024-04-12 18:48:35] INFO (NNIManager) Setup training service...
[2024-04-12 18:48:35] INFO (LocalTrainingService) Construct local machine training service.
[2024-04-12 18:48:35] INFO (NNIManager) Setup tuner...
[2024-04-12 18:48:35] INFO (NNIManager) Change NNIManager status from: INITIALIZED to: RUNNING
[2024-04-12 18:48:36] INFO (NNIManager) Add event listeners
[2024-04-12 18:48:36] INFO (LocalTrainingService) Run local machine training service.
[2024-04-12 18:48:36] INFO (NNIManager) NNIManager received command from dispatcher: ID,
[2024-04-12 18:48:36] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"lr": 0.0002, "beta1": 0.0001, "beta2": 0.999, "lambda_e": 5e-05}, "parameter_index": 0}
[2024-04-12 18:48:36] INFO (NNIManager) NNIManager received command from dispatcher: TR, {"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"lr": 0.001, "beta1": 1e-05, "beta2": 0.9, "lambda_e": 5e-05}, "parameter_index": 0}
[2024-04-12 18:48:41] INFO (NNIManager) submitTrialJob: form: {
sequenceId: 0,
hyperParameters: {
value: '{"parameter_id": 0, "parameter_source": "algorithm", "parameters": {"lr": 0.0002, "beta1": 0.0001, "beta2": 0.999, "lambda_e": 5e-05}, "parameter_index": 0}',
index: 0
},
placementConstraint: { type: 'None', gpus: [] }
}
[2024-04-12 18:48:41] INFO (NNIManager) submitTrialJob: form: {
sequenceId: 1,
hyperParameters: {
value: '{"parameter_id": 1, "parameter_source": "algorithm", "parameters": {"lr": 0.001, "beta1": 1e-05, "beta2": 0.9, "lambda_e": 5e-05}, "parameter_index": 0}',
index: 0
},
placementConstraint: { type: 'None', gpus: [] }
}
[2024-04-12 18:48:51] INFO (NNIManager) Trial job ZlXeN status changed from WAITING to RUNNING
[2024-04-12 18:48:51] INFO (NNIManager) Trial job Rh0Pn status changed from WAITING to RUNNING
[2024-04-12 18:49:42] ERROR (tuner_command_channel.WebSocketChannel) Error: Error: tuner_command_channel: Tuner closed connection
at WebSocket.handleWsClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:83:26)
at WebSocket.emit (node:events:538:35)
at WebSocket.emitClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:246:10)
at Socket.socketOnClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:1127:15)
at Socket.emit (node:events:526:28)
at TCP. (node:net:687:12)
dispatcher.log:
[2024-04-12 18:48:35] INFO (numexpr.utils/MainThread) Note: NumExpr detected 64 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
[2024-04-12 18:48:35] INFO (numexpr.utils/MainThread) NumExpr defaulting to 8 threads.
[2024-04-12 18:48:36] INFO (nni.tuner.tpe/MainThread) Using random seed 1314744945
[2024-04-12 18:48:36] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher started
[2024-04-12 18:49:19] ERROR (nni.runtime.msg_dispatcher_base/Thread-2) Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
Traceback (most recent call last):
File "/home/yiran/.local/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 108, in command_queue_worker
self.process_command(command, data)
File "/home/yiran/.local/lib/python3.8/site-packages/nni/runtime/msg_dispatcher_base.py", line 154, in process_command
command_handlerscommand
File "/home/yiran/.local/lib/python3.8/site-packages/nni/runtime/msg_dispatcher.py", line 144, in handle_report_metric_data
data['value'] = load(data['value'])
File "/home/yiran/.local/lib/python3.8/site-packages/nni/common/serializer.py", line 443, in load
return json_tricks.loads(string, obj_pairs_hooks=hooks, **json_tricks_kwargs)
File "/home/yiran/.local/lib/python3.8/site-packages/json_tricks/nonp.py", line 259, in loads
return _strip_loads(string, hook, True, **jsonkwargs)
File "/home/yiran/.local/lib/python3.8/site-packages/json_tricks/nonp.py", line 266, in _strip_loads
return json_loads(string, object_pairs_hook=object_pairs_hook, **jsonkwargs)
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/json/init.py", line 370, in loads
return cls(**kw).decode(s)
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/json/decoder.py", line 353, in raw_decode
obj, end = self.scan_once(s, idx)
File "/home/yiran/.local/lib/python3.8/site-packages/json_tricks/decoders.py", line 46, in call
map = hook(map, properties=self.properties)
File "/home/yiran/.local/lib/python3.8/site-packages/json_tricks/utils.py", line 66, in wrapper
return encoder(*args, **{k: v for k, v in kwargs.items() if k in names})
File "/home/yiran/.local/lib/python3.8/site-packages/nni/common/serializer.py", line 877, in _json_tricks_any_object_decode
return _wrapped_cloudpickle_loads(b)
File "/home/yiran/.local/lib/python3.8/site-packages/nni/common/serializer.py", line 883, in _wrapped_cloudpickle_loads
return cloudpickle.loads(b)
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/storage.py", line 161, in _load_from_bytes
return torch.load(io.BytesIO(b))
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 608, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 787, in _legacy_load
result = unpickler.load()
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 743, in persistent_load
deserialized_objects[root_key] = restore_location(obj, location)
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 175, in default_restore_location
result = fn(storage, location)
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 151, in _cuda_deserialize
device = validate_cuda_device(location)
File "/home/yiran/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/serialization.py", line 135, in validate_cuda_device
raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.
[2024-04-12 18:49:40] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher exiting...
[2024-04-12 18:49:42] INFO (nni.runtime.msg_dispatcher_base/MainThread) Dispatcher terminiated
nnictl stdout and stderr:
Experiment b7edpl94 start: 2024-04-12 18:48:34.614673
node:events:504
throw er; // Unhandled 'error' event
^
Error: tuner_command_channel: Tuner closed connection
at WebSocket.handleWsClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:83:26)
at WebSocket.emit (node:events:538:35)
at WebSocket.emitClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:246:10)
at Socket.socketOnClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:1127:15)
at Socket.emit (node:events:526:28)
at TCP. (node:net:687:12)
Emitted 'error' event at:
at WebSocketChannelImpl.handleError (/home/yiran/.local/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:135:22)
at WebSocket.handleWsClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:83:14)
at WebSocket.emit (node:events:538:35)
[... lines matching original stack trace ...]
at TCP. (node:net:687:12)
Thrown at:
at handleWsClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/core/tuner_command_channel/websocket_channel.js:83:26)
at emit (node:events:538:35)
at emitClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:246:10)
at socketOnClose (/home/yiran/.local/lib/python3.8/site-packages/nni_node/node_modules/express-ws/node_modules/ws/lib/websocket.js:1127:15)
at emit (node:events:526:28)
at node:net:687:12
How to reproduce it?:
The text was updated successfully, but these errors were encountered: