Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LLm serving] Fix timeout setting bug #2398

Closed
wants to merge 75 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
9e7ef07
add codee
Sep 25, 2023
4aa21bd
add copyright
jiangjiajun Sep 25, 2023
6fd06f7
fix some bugs
jiangjiajun Sep 25, 2023
9f569a7
Update prefix_utils.py
jiangjiajun Sep 25, 2023
bdf2748
Update triton_model.py
jiangjiajun Sep 25, 2023
52aaffb
Update triton_model.py
jiangjiajun Sep 25, 2023
0199cac
fix tokenizer
jiangjiajun Sep 25, 2023
fa151a7
Add check for prefix len
jiangjiajun Sep 25, 2023
800c6a9
Create README.md
jiangjiajun Sep 26, 2023
91ea8bb
Create test_client.py
jiangjiajun Sep 26, 2023
5e8221e
Update task.py
jiangjiajun Sep 26, 2023
9897924
add debug log and fix ptuning
jiangjiajun Sep 26, 2023
8d1e691
update version
jiangjiajun Sep 26, 2023
388eb9b
Update triton_model.py
jiangjiajun Oct 7, 2023
68c15b6
Update README.md
jiangjiajun Oct 8, 2023
30a3beb
Support chatglm-6b (#2223)
jiangjiajun Oct 10, 2023
b96a92b
Support bloom (#2232)
jiangjiajun Oct 11, 2023
80bb8ed
Support multicards (#2234)
jiangjiajun Oct 11, 2023
986b233
[LLM] Add prefix for chatglm (#2233)
rainyfly Oct 12, 2023
9fa04c3
Update engine.py
jiangjiajun Oct 12, 2023
e6a7d4e
[LLM] Fix P-Tuning difference (#2240)
jiangjiajun Oct 13, 2023
51d8697
[LLM] Support prefix for bloom (#2237)
rainyfly Oct 16, 2023
73c1507
Support bloom prefix (#2245)
rainyfly Oct 17, 2023
528e976
[LLM] Fix serving (#2246)
jiangjiajun Oct 18, 2023
1cbbaee
fix chatglm
jiangjiajun Oct 18, 2023
2f2c824
Update config.py
jiangjiajun Oct 18, 2023
66a4897
[LLM] Support bloom prefix (#2248)
rainyfly Oct 19, 2023
4d956d3
[LLM] Add simple client
jiangjiajun Oct 19, 2023
a5a261b
add requirements
jiangjiajun Oct 19, 2023
4c21588
[LLM] Support dynamic batching for chatglm (#2251)
jiangjiajun Oct 20, 2023
8ff24d6
[LLM] Support dybatch for bloom (#2255)
jiangjiajun Oct 20, 2023
3a4f8a9
remove +1 for chatglm
jiangjiajun Oct 20, 2023
e5da0f1
Update setup.py
jiangjiajun Oct 20, 2023
6da9555
Add check for prefix and compatible with lite
jiangjiajun Oct 24, 2023
10eefcb
add requires
jiangjiajun Oct 24, 2023
fb0f276
Support gpt
jiangjiajun Oct 24, 2023
7193337
Fix triton model problem
jiangjiajun Oct 25, 2023
70f8469
Update version
jiangjiajun Oct 25, 2023
b116e3e
Add some tools
jiangjiajun Oct 26, 2023
b86524e
test
Nov 6, 2023
2e6bc1a
Update triton_model.py
jiangjiajun Nov 7, 2023
cabebc3
Update setup.py
jiangjiajun Nov 7, 2023
cdc0ff2
Update README.md
jiangjiajun Nov 7, 2023
7c23864
test FastDeploy
Nov 7, 2023
cca470f
Merge branch 'llm' into llm
karagg Nov 7, 2023
fb5d5c5
test
Nov 8, 2023
2d2274c
[LLM] Add ci test scripts (#2272)
karagg Nov 9, 2023
a55837e
delete run.sh
Nov 14, 2023
f9c8581
Merge branch 'PaddlePaddle:llm' into llm
karagg Nov 14, 2023
1f76abf
delete run.sh
Nov 14, 2023
9c6b2de
update run.sh
Nov 14, 2023
ceb49a4
update run.sh ci.py
Nov 14, 2023
9499199
update ci.py
Nov 15, 2023
8bf70a1
update ci.py
Nov 15, 2023
6e15209
[LLM]update ci test script (#2285)
karagg Nov 15, 2023
be12232
debug
Nov 15, 2023
f884c1a
debug
Nov 15, 2023
57e7608
Merge pull request #2286 from karagg/llm
Zeref996 Nov 15, 2023
7b80d70
debug
Nov 15, 2023
bb68a7e
Merge pull request #2288 from karagg/llm
Zeref996 Nov 16, 2023
6cb1474
debug
Nov 16, 2023
71652e3
Merge pull request #2289 from karagg/llm
Zeref996 Nov 16, 2023
261e519
update run.sh
Nov 17, 2023
836d21f
add comment
Nov 20, 2023
87f53ea
do not merge
Nov 20, 2023
66c4563
Rename test_max_batch_size.sh to test_max_batch_size.py
jiangjiajun Nov 23, 2023
79e6a1e
update
Dec 4, 2023
3376284
Merge pull request #2291 from karagg/llm
Zeref996 Dec 5, 2023
fda8c37
Improve robustness for llm (#2321)
rainyfly Dec 14, 2023
cc89731
detail log for llm (#2325)
rainyfly Dec 14, 2023
67ca253
Fix a bug for llm serving (#2326)
rainyfly Dec 14, 2023
7bddc67
Add warning for server hangs (#2333)
rainyfly Dec 27, 2023
c18abc6
Add fastapi support (#2371)
rainyfly Feb 27, 2024
a843a3c
Add fastapi support (#2383)
rainyfly Feb 28, 2024
6b127d5
Fix timeout setting bug
rainyfly Mar 6, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 69 additions & 0 deletions llm/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# 环境安装

- Step 1. 安装develop版本PaddlePaddle
- Step 2. 从源码安装PaddleNLP
- Step 3. 进入源码PaddleNLP/csrc,执行`python3 setup_cuda.py install --user`安装自定义OP


## 导出模型
```
cd PaddleNLP/llm
python export_model.py \
--model_name_or_path meta-llama/Llama-2-7b-chat \
--output_path ./inference \
--dtype float16
```

## 本地测试模型

```
wget https://bj.bcebos.com/paddle2onnx/third_libs/inputs_63.jsonl
mkdir res
```
测试脚本如下,预测结果将会保存在res目录下
```
import fastdeploy_llm as fdlm
import copy
config = fdlm.Config("chatglm-6b")
config.max_batch_size = 1
config.mp_num = 1
config.max_dec_len = 1024
config.max_seq_len = 1024
config.decode_strategy = "sampling"
config.stop_threshold = 2
config.disable_dynamic_batching = 1
config.max_queue_num = 512
config.is_ptuning = 0

inputs = list()
with open("inputs_63.jsonl", "r") as f:
for line in f:
data = eval(line.strip())
prompt = data["src"]
inputs.append((prompt, data))

model = fdlm.ServingModel(config)

def call_back(call_back_task, token_tuple, index, is_last_token, sender=None):
with open("res/{}".format(call_back_task.task_id), "a+") as f:
f.write("{}\n".format(token_tuple))

for i, ipt in enumerate(inputs):
task = fdlm.Task()
task.text = ipt[0]
task.max_dec_len = 1024
task.min_dec_len = 1
task.penalty_score = 1.0
task.temperature = 1.0
task.topp = 0.0
task.frequency_score = 0.0
task.eos_token_id = 2
task.presence_score = 0.0
task.task_id = i
task.call_back_func = call_back
model.add_request(task)

model.start()
# 停止接收新的请求,处理完请求后,全部自行退出
model.stop()
```
20 changes: 20 additions & 0 deletions llm/fastdeploy_llm/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from .model import Model
from .serving.serving_model import ServingModel
from .task import Task, BatchTask
from .config import Config
from . import utils
from .client import GrpcClient
1 change: 1 addition & 0 deletions llm/fastdeploy_llm/client/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .grpc_client import GrpcClient
191 changes: 191 additions & 0 deletions llm/fastdeploy_llm/client/grpc_client.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,191 @@
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import queue
import json
import sys
from functools import partial
import os
import time
import numpy as np
import subprocess
from fastdeploy_llm.utils.logging_util import logger
try:
import tritonclient.grpc as grpcclient
from tritonclient.utils import *
except:
pass


class UserData:
def __init__(self):
self._completed_requests = queue.Queue()


def callback(user_data, result, error):
if error:
user_data._completed_requests.put(error)
else:
user_data._completed_requests.put(result)


class GrpcClient:
def __init__(self,
url: str,
model_name: str,
model_version: str="1",
timeout: int=1000000,
openai_port: int=None):
"""
Args:
url (`str`): inference server grpc url
model_name (`str`)
model_version (`str`): default "1"
timeout (`int`): inference timeout in seconds
openai_port (`int`)
"""
self._model_name = model_name
self._model_version = model_version
self.timeout = timeout
self.url = url

def generate(self,
prompt: str,
request_id: str="0",
top_p: float=0.0,
temperature: float=1.0,
max_dec_len: int=1024,
min_dec_len: int=2,
penalty_score: float=1.0,
frequency_score: float=0.99,
eos_token_id: int=2,
presence_score: float=0.0,
stream: bool=False):
import tritonclient.grpc as grpcclient
#from tritonclient.utils import *

user_data = UserData()
req_dict = {
"text": prompt,
"topp": top_p,
"temperature": temperature,
"max_dec_len": max_dec_len,
"min_dec_len": min_dec_len,
"penalty_score": penalty_score,
"frequency_score": frequency_score,
"eos_token_id": eos_token_id,
"model_test": "test",
"presence_score": presence_score
}

inputs = [
grpcclient.InferInput("IN", [1], np_to_triton_dtype(np.object_))
]
outputs = [grpcclient.InferRequestedOutput("OUT")]

in_data = np.array([json.dumps(req_dict)], dtype=np.object_)

user_data = UserData()
with grpcclient.InferenceServerClient(
url=self.url, verbose=False) as triton_client:
triton_client.start_stream(callback=partial(callback, user_data))
inputs[0].set_data_from_numpy(in_data)
triton_client.async_stream_infer(
model_name=self._model_name,
inputs=inputs,
request_id=request_id,
outputs=outputs)
response = dict()
response["token_ids"] = list()
response["token_strs"] = list()
response["input"] = req_dict
while True:
data_item = user_data._completed_requests.get(
timeout=self.timeout)
if type(data_item) == InferenceServerException:
logger.error(
"Error happend while generating, status={}, msg={}".
format(data_item.status(), data_item.message()))
response["error_info"] = (data_item.status(),
data_item.message())
break
else:
results = data_item.as_numpy("OUT")[0]
data = json.loads(results)
response["token_ids"] += data["token_ids"]
response["token_strs"].append(data["result"])
if data.get("is_end", False):
break
return response

def async_generate(self,
prompt: str,
request_id: str="0",
top_p: float=0.0,
temperature: float=1.0,
max_dec_len: int=1024,
min_dec_len: int=2,
penalty_score: float=1.0,
frequency_score: float=0.99,
eos_token_id: int=2,
presence_score: float=0.0,
stream: bool=False):
import tritonclient.grpc as grpcclient
#from tritonclient.utils import *

user_data = UserData()
req_dict = {
"text": prompt,
"topp": top_p,
"temperature": temperature,
"max_dec_len": max_dec_len,
"min_dec_len": min_dec_len,
"penalty_score": penalty_score,
"frequency_score": frequency_score,
"eos_token_id": eos_token_id,
"model_test": "test",
"presence_score": presence_score
}

inputs = [
grpcclient.InferInput("IN", [1], np_to_triton_dtype(np.object_))
]
outputs = [grpcclient.InferRequestedOutput("OUT")]

in_data = np.array([json.dumps(req_dict)], dtype=np.object_)

user_data = UserData()
with grpcclient.InferenceServerClient(
url=self.url, verbose=False) as triton_client:
triton_client.start_stream(callback=partial(callback, user_data))
inputs[0].set_data_from_numpy(in_data)
triton_client.async_stream_infer(
model_name=self._model_name,
inputs=inputs,
request_id=request_id,
outputs=outputs)
while True:
data_item = user_data._completed_requests.get(
timeout=self.timeout)
if type(data_item) == InferenceServerException:
logger.error(
"Error happend while generating, status={}, msg={}".
format(data_item.status(), data_item.message()))
break
else:
results = data_item.as_numpy("OUT")[0]
data = json.loads(results)
yield data
if data.get("is_end", False):
break
Loading
Loading