You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(pid=91343) [2023-11-30 19:53:26,142 E 91343 91820] gcs_rpc_client.h:552: Failed to connect to GCS within 60 seconds. GCS may have been killed. It's either GCS is terminated by `ray stop` or is killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out. https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure. The program will terminate.
(pid=98374) E1130 19:53:52.486123357 99023 chttp2_transport.cc:2761] keepalive_ping_end state error: 0 (expect: 1)
2023-11-30 19:53:53,141 WARNING worker.py:2074 -- The node with node id: 19a548a9a5a2ece8fe9589ae0846a32901d76bf544f5e286201bb770 and address: 10.168.0.2 and node name: 10.168.0.2 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, preempted node, etc.)
(2) raylet has lagging heartbeats due to slow network or busy workload.
Traceback (most recent call last):
File "/home/francescovirga/llmperf/llmperf.py", line 480, in <module>
query_results = endpoint_evaluation(endpoint_config, sample_lines)
File "/home/francescovirga/llmperf/llmperf.py", line 270, in endpoint_evaluation
results = ray.get(futures)
File "/home/francescovirga/.local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
return fn(*args, **kwargs)
File "/home/francescovirga/.local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
File "/home/francescovirga/.local/lib/python3.10/site-packages/ray/_private/worker.py", line 2565, in get
raise value
ray.exceptions.LocalRayletDiedError: The task's local raylet died. Check raylet.out for more information.
(pid=98046) None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. [repeated 74x across cluster]
(validate pid=68909) [2023-11-30 19:53:53,594 E 68909 69441] gcs_rpc_client.h:552: Failed to connect to GCS within 60 seconds. GCS may have been killed. It's either GCS is terminated by `ray stop` or is killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out. https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure. The program will terminate. [repeated 205x across cluster]
(pid=85196) E1130 19:53:51.788929082 85763 chttp2_transport.cc:2761] keepalive_ping_end state error: 0 (expect: 1) [repeated 666x across cluster]
The text was updated successfully, but these errors were encountered:
I'm finding that this tool often crashes when running tests at concurrency levels >500.
Here's the command I'm running:
python llmperf.py -f openai -r 2000 -c 2000 -m "mistralai/Mistral-7B-v0.1"
Here's the error log I'm seeing:
The text was updated successfully, but these errors were encountered: