Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unhandled exceptions of restful api lead to server hang #1424

Open
frostyplanet opened this issue May 3, 2024 · 0 comments
Open

Unhandled exceptions of restful api lead to server hang #1424

frostyplanet opened this issue May 3, 2024 · 0 comments
Labels
Milestone

Comments

@frostyplanet
Copy link
Contributor

frostyplanet commented May 3, 2024

Describe the bug

When access restful api with wrong model_uid, and error raise enough of times (benchmark with > 100 concurrence), will lead to completely dead lock on server.
problem might be in restful api or xoscar of unhandle exceptions with locks.

To Reproduce

  1. Python version: 3.10.12
  2. Versions of crucial packages:
    xoscar: 0.3.0
    torch: 2.2.2
    vllm: 0.4.1
    transformers: 4.40.1
  3. The version of xinference:
    xinference : commit 7c974be
  4. hardware enviroment: reproduce on multiple deployment, 4090x8 and a40x8
  5. Steps to reproduce

a). env XINFERENCE_MODEL_SRC=modelscope xinference-local
b) xinference login --username administrator --password administrator
c) launch the model with model_uid "qwen1.5-7"

 xinference launch -u qwen1.5-7 -n qwen1.5-chat -s 7 --max_model_len 8192 --dtype half -f gptq -q Int4 --n-gpu 1 

d) benchmark the model with wrong model_uid "qwen1.5.7-1"

 env HF_ENDPOINT=https://hf-mirror.com python benchmark/benchmark_serving.py --dataset ~/dataset/ShareGPT_V3_unfiltered_cleaned_split.json --tokenizer qwen/qwen1.5-7B-chat-gptq-int4 --model-uid qwen1.5-7-1 --num-prompts 400 

Will raise many exception like:

Traceback (most recent call last):                                                      
  File "/home/clouduser/inference/xinference/api/restful_api.py", line 1322, in create_chat_completion            
    model = await (await self._get_supervisor_ref()).get_model(model_uid)
  File "/home/clouduser/anaconda3/lib/python3.11/site-packages/xoscar/backends/context.py", line 227, in send
    return self._process_result_message(result)
  File "/home/clouduser/anaconda3/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/clouduser/anaconda3/lib/python3.11/site-packages/xoscar/backends/pool.py", line 659, in send
    result = await self._run_coro(message.message_id, coro)
  File "/home/clouduser/anaconda3/lib/python3.11/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
    return await coro
  File "/home/clouduser/anaconda3/lib/python3.11/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore                        
  File "xoscar/core.pyx", line 558, in __on_receive__                                   
    raise ex                                                                            
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__         
    async with self._lock: 
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__  
    with debug_async_timeout('actor_lock_timeout', 
    ^^^^^^^^^^^^^^^^^ 
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result 
    ^^^^^^^^^^^^^^^^^                                                                   
  File "/home/clouduser/inference/xinference/core/utils.py", line 45, in wrapped
    ret = await func(*args, **kwargs) 
    ^^^^^^^^^^^^^^^^^ 
  File "/home/clouduser/inference/xinference/core/supervisor.py", line 989, in get_model              
    raise ValueError(f"Model not found in the model list, uid: {model_uid}")
    ^^^^^^^^^^^^^^^^^                                                                   
ValueError: [address=127.0.0.1:27897, pid=212329] Model not found in the model list, uid: qwen1.5-7-1

e) all command will hang afterwards:

or 
   xinference terminate --model-uid qwen1.5-7

Expected behavior

Server should not hang after running benchmark script with invalid model_uid

@XprobeBot XprobeBot added the gpu label May 3, 2024
@XprobeBot XprobeBot added this to the v0.11.0 milestone May 3, 2024
@XprobeBot XprobeBot modified the milestones: v0.11.0, v0.11.1, v0.11.2 May 11, 2024
@XprobeBot XprobeBot modified the milestones: v0.11.2, v0.11.3 May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants