Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashing with large number of concurrent users #17

Open
francescov1 opened this issue Nov 30, 2023 · 0 comments
Open

Crashing with large number of concurrent users #17

francescov1 opened this issue Nov 30, 2023 · 0 comments

Comments

@francescov1
Copy link

I'm finding that this tool often crashes when running tests at concurrency levels >500.

Here's the command I'm running: python llmperf.py -f openai -r 2000 -c 2000 -m "mistralai/Mistral-7B-v0.1"

Here's the error log I'm seeing:

(pid=91343) [2023-11-30 19:53:26,142 E 91343 91820] gcs_rpc_client.h:552: Failed to connect to GCS within 60 seconds. GCS may have been killed. It's either GCS is terminated by `ray stop` or is killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out. https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure. The program will terminate.
(pid=98374) E1130 19:53:52.486123357   99023 chttp2_transport.cc:2761]   keepalive_ping_end state error: 0 (expect: 1)
2023-11-30 19:53:53,141 WARNING worker.py:2074 -- The node with node id: 19a548a9a5a2ece8fe9589ae0846a32901d76bf544f5e286201bb770 and address: 10.168.0.2 and node name: 10.168.0.2 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a        (1) raylet crashes unexpectedly (OOM, preempted node, etc.)
        (2) raylet has lagging heartbeats due to slow network or busy workload.
Traceback (most recent call last):
  File "/home/francescovirga/llmperf/llmperf.py", line 480, in <module>
    query_results = endpoint_evaluation(endpoint_config, sample_lines)
  File "/home/francescovirga/llmperf/llmperf.py", line 270, in endpoint_evaluation
    results = ray.get(futures)
  File "/home/francescovirga/.local/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/francescovirga/.local/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/francescovirga/.local/lib/python3.10/site-packages/ray/_private/worker.py", line 2565, in get
    raise value
ray.exceptions.LocalRayletDiedError: The task's local raylet died. Check raylet.out for more information.
(pid=98046) None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used. [repeated 74x across cluster]
(validate pid=68909) [2023-11-30 19:53:53,594 E 68909 69441] gcs_rpc_client.h:552: Failed to connect to GCS within 60 seconds. GCS may have been killed. It's either GCS is terminated by `ray stop` or is killed unexpectedly. If it is killed unexpectedly, see the log file gcs_server.out. https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure. The program will terminate. [repeated 205x across cluster]
(pid=85196) E1130 19:53:51.788929082   85763 chttp2_transport.cc:2761]   keepalive_ping_end state error: 0 (expect: 1) [repeated 666x across cluster]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant