Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] 开启vllm加速后,客户端多并发调用服务时,出现Connection broken: InvalidChunkLength报错 #3885

Closed
sweetautumn opened this issue Apr 26, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@sweetautumn
Copy link

问题描述 / Problem Description
用简洁明了的语言描述这个问题 / Describe the problem in a clear and concise manner.
开启vllm加速启动服务后:
问题1:第一次调用若是多个并发去调用,则会因为模型没加载而调用失败;若是第一次调用是单次去调用,则可正常返回结果
问题2:多并发调用时,偶尔会出现报错:requests.exceptions.ChunkedEncodingError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))

复现问题的步骤 / Steps to Reproduce
1.设置vllm加速:
FSCHAT_MODEL_WORKERS = {
# 所有模型共用的默认配置,可在模型专项配置中进行覆盖。
"default": {
"host": DEFAULT_BIND_HOST,
"port": 30002,
"device": LLM_DEVICE,
# False,'vllm',使用的推理加速框架,使用vllm如果出现HuggingFace通信问题,参见doc/FAQ
# vllm对一些模型支持还不成熟,暂时默认关闭
"infer_turbo": 'vllm',

    "max_parallel_loading_workers":3,
    "enforce_eager":False,
    "max_context_len_to_capture":2048,
    "max_model_len":2048,

    # model_worker多卡加载需要配置的参数
    # "gpus": None, # 使用的GPU,以str的格式指定,如"0,1",如失效请使用CUDA_VISIBLE_DEVICES="0,1"等形式指定
    # "num_gpus": 1, # 使用GPU的数量
    # "max_gpu_memory": "20GiB", # 每个GPU占用的最大显存

    # 以下为model_worker非常用参数,可根据需要配置
    # "load_8bit": False, # 开启8bit量化
    # "cpu_offloading": None,
    # "gptq_ckpt": None,
    # "gptq_wbits": 16,
    # "gptq_groupsize": -1,
    # "gptq_act_order": False,
    # "awq_ckpt": None,
    # "awq_wbits": 16,
    # "awq_groupsize": -1,
    # "model_names": LLM_MODELS,
    # "conv_template": None,
    # "limit_worker_concurrency": 5,
    # "stream_interval": 2,
    # "no_register": False,
    # "embed_in_truncate": False,

    # 以下为vllm_worker配置参数,注意使用vllm必须有gpu,仅在Linux测试通过

    # tokenizer = model_path # 如果tokenizer与model_path不一致在此处添加
    'tokenizer_mode':'auto',
    'trust_remote_code':True,
    'download_dir':None,
    'load_format':'auto',
    'dtype':'auto',
    'seed':0,
    'worker_use_ray':False,
    'pipeline_parallel_size':1,
    'tensor_parallel_size':1,
    'block_size':16,
    'swap_space':4 , # GiB
    'gpu_memory_utilization':0.80,
    'max_num_batched_tokens':2560,
    'max_num_seqs':256,
    'disable_log_stats':False,
    'conv_template':None,
    'limit_worker_concurrency':3,
    'no_register':False,
    'num_gpus': 1,
    'engine_use_ray': False,
    'disable_log_requests': False

},

2.启动服务:
python startup.py -a

3.python代码多并发调用

预期的结果 / Expected Result
正常返回生成的答案
实际结果 / Actual Result
描述实际发生的结果 / Describe the actual result.
多并发请求服务,偶尔能正常执行完,偶尔会部分正常执行,部分报错:
requests.exceptions.ChunkedEncodingError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))

环境信息 / Environment Information

  • langchain-ChatGLM 版本/commit 号:v0.2.10
  • 是否使用 Docker 部署(是/否):否
  • 使用的模型(ChatGLM2-6B / Qwen-7B 等):ChatGLM3-6B
  • 使用的 Embedding 模型(moka-ai/m3e-base 等):bge-m3
  • 使用的向量库类型 (faiss / milvus / pg_vector 等): faiss
  • 操作系统及版本 / Operating system and version:
  • Python 版本 / Python version:
  • 其他相关环境信息 / Other relevant environment information:

附加信息 / Additional Information
添加与问题相关的任何其他信息 / Add any other information related to the issue.

@sweetautumn sweetautumn added the bug Something isn't working label Apr 26, 2024
Copy link

dosubot bot commented Apr 26, 2024

To address the issues you're experiencing with Langchain-ChatGLM, particularly after enabling VLLM acceleration, consider the following steps:

  1. Model Loading: Ensure the model is fully loaded before it starts serving requests. Implement a health check that confirms the model's readiness. This can help avoid initial call failures due to the model not being loaded.

  2. Concurrency Settings: The InvalidChunkLength errors suggest a need to adjust how concurrent requests are handled. Experiment with the limit_worker_concurrency parameter to find an optimal setting for your workload.

  3. VLLM Configuration: Since enabling VLLM acceleration introduces the issue, closely review and adjust VLLM-related configurations such as tensor_parallel_size, block_size, swap_space, and gpu_memory_utilization.

  4. System Resources: Monitor and ensure your system resources (CPU, GPU, memory) are not overwhelmed by the load, as this can lead to errors.

  5. Error Handling and Retries: Implement error handling and retry mechanisms for intermittent InvalidChunkLength errors, which can improve application stability.

  6. Software Updates: Consider updating to the latest version of Langchain-ChatGLM if available, as it may contain fixes and improvements for your issues.

These steps are derived from analyzing the provided configuration and the issues you're facing. Adjusting these areas should help mitigate the problems with initial call failures and intermittent InvalidChunkLength errors.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

@Zephyr69
Copy link

问题1同。只有第一次调用会有这个问题。

@zRzRzRzRzRzRzR zRzRzRzRzRzRzR self-assigned this May 5, 2024
@zRzRzRzRzRzRzR
Copy link
Collaborator

这个项目没有做并发。相关问题需要自己兼容

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants