Skip to content

Commit

Permalink
Add KServe document. (#984)
Browse files Browse the repository at this point in the history
Signed-off-by: Syulin7 <[email protected]>
  • Loading branch information
Syulin7 committed Aug 29, 2023
1 parent 2029700 commit 14fa45c
Show file tree
Hide file tree
Showing 4 changed files with 290 additions and 0 deletions.
5 changes: 5 additions & 0 deletions docs/serving/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,3 +40,8 @@ If you want to use arena to manage serving jobs, this guide is for you. we have

* I want to [submit a nvidia triton serving job which use gpus](triton/serving.md).
* I want to [update a nvidia triton serving job after deployed](triton/update-serving.md).

## KServe Job Guide

* I want to [submit a kserve job with supported serving runtime](kserve/sklearn.md)
* I want to [submit a kserve job with custom serving runtime](kserve/custom.md)
142 changes: 142 additions & 0 deletions docs/serving/kserve/custom.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# KServe job with custom serving runtime

This guide walks through the steps to deploy and serve a custom serving runtime with kserve.

1\. Setup

Follow the [KServe Guide](https://kserve.github.io/website/master/admin/serverless/serverless/) to install Kserve.

2\. Submit your serving job into kserve

create a PVC 'training-data' before, and then download the 'bloom-560m' model from HuggingFace to the PVC.

deploy an InferenceService with a predictor that will load a bloom model with text-generation-inference.

$ arena serve kserve \
--name=bloom-560m \
--image=ghcr.io/huggingface/text-generation-inference:1.0.2 \
--gpus=1 \
--cpu=12 \
--memory=50Gi \
--port=8080 \
--env=STORAGE_URI=pvc://training-data \
"text-generation-launcher --disable-custom-kernels --model-id /mnt/models/bloom-560m --num-shard 1 -p 8080"

inferenceservice.serving.kserve.io/bloom-560m created
INFO[0010] The Job bloom-560m has been submitted successfully
INFO[0010] You can run `arena serve get bloom-560m --type kserve -n default` to check the job status

3\. Check the status of KServe job

$ arena serve list
NAME TYPE VERSION DESIRED AVAILABLE ADDRESS PORTS
bloom-560m KServe 00001 1 1 http://bloom-560m.default-group.example.com :80 1

$ arena serve get sklearn-iris
Name: bloom-560m
Namespace: default
Type: KServe
Version: 00001
Desired: 1
Available: 1
Age: 7m
Address: http://bloom-560m.default.example.com
Port: :80
GPU: 1

LatestRevision: bloom-560m-predictor-00001
LatestPrecent: 100

Instances:
NAME STATUS AGE READY RESTARTS GPU NODE
---- ------ --- ----- -------- --- ----
bloom-560m-predictor-00001-deployment-56b8bdbf87-sg8v8 Running 7m 2/2 0 1 192.168.5.241

4\. Perform inference

you can curl with the ingress gateway external IP using the HOST Header.

$ curl -H "Host: bloom-560m.default.example.com" http://${INGRESS_HOST}:80/generate \
-X POST \
-d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' \
-H 'Content-Type: application/json'

{"generated_text":" Deep Learning is a new type of machine learning that is used to solve complex problems."}

5\. Update the InferenceService with the canary rollout strategy

Add the canaryTrafficPercent field to the predictor component and update command to use a new/updated model path /mnt/models/bloom-560m-v2.

$ arena serve update kserve \
--name bloom-560m \
--canary-traffic-percent=10 \
"text-generation-launcher --disable-custom-kernels --model-id /mnt/models/bloom-560m-v2 --num-shard 1 -p 8036"

After rolling out the canary model, traffic is split between the latest ready revision 2 and the previously rolled out revision 1.

$ arena serve get bloom-560m
Name: bloom-560m
Namespace: default
Type: KServe
Version: 00002
Desired: 2
Available: 2
Age: 26m
Address: http://bloom-560m.default.example.com
Port: :80

LatestRevision: bloom-560m-predictor-00002
LatestPrecent: 10
PrevRevision: bloom-560m-predictor-00001
PrevPrecent: 90

Instances:
NAME STATUS AGE READY RESTARTS GPU NODE
---- ------ --- ----- -------- --- ----
bloom-560m-predictor-00001-deployment-56b8bdbf87-sg8v8 Running 19m 2/2 0 1 192.168.5.241
bloom-560m-predictor-00002-deployment-84dbb64cc4-647wx Running 2m 2/2 0 1 192.168.5.239

6\. Promote the canary model

If the canary model is healthy/passes your tests, you can set canary-traffic-percent to 100.

$ arena serve update kserve \
--name bloom-560m \
--canary-traffic-percent=100

Now all traffic goes to the revision 2 for the new model. The pods for revision generation 1 automatically scales down to 0 as it is no longer getting the traffic.

$ arena serve get bloom-560m
Name: bloom-560m
Namespace: default
Type: KServe
Version: 00002
Desired: 2
Available: 2
Age: 26m
Address: http://bloom-560m.default.example.com
Port: :80

LatestRevision: bloom-560m-predictor-00002
LatestPrecent: 100

Instances:
NAME STATUS AGE READY RESTARTS GPU NODE
---- ------ --- ----- -------- --- ----
bloom-560m-predictor-00001-deployment-56b8bdbf87-sg8v8 Terminating 22m 1/2 0 0 192.168.5.241
bloom-560m-predictor-00002-deployment-84dbb64cc4-647wx Running 5m 2/2 0 1 192.168.5.239

7\. Delete the kserve job

$ arena serve delete sklearn-iris











140 changes: 140 additions & 0 deletions docs/serving/kserve/sklearn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# KServe job with supported serving runtime

This guide walks through the steps to deploy and serve a supported serving runtime with kserve.

1\. Setup

Follow the [KServe Guide](https://kserve.github.io/website/master/admin/serverless/serverless/) to install Kserve.

2\. Submit your serving job into kserve

deploy an InferenceService with a predictor that will load a scikit-learn model.

$ arena serve kserve \
--name=sklearn-iris \
--model-format=sklearn \
--storage-uri=gs://kfserving-examples/models/sklearn/1.0/model

inferenceservice.serving.kserve.io/sklearn-iris created
INFO[0009] The Job sklearn-iris has been submitted successfully
INFO[0009] You can run `arena serve get sklearn-iris --type kserve -n default` to check the job status

3\. Check the status of KServe job

$ arena serve list
NAME TYPE VERSION DESIRED AVAILABLE ADDRESS PORTS
sklearn-iris KServe 00001 1 1 http://sklearn-iris.default.example.com :80

$ arena serve get sklearn-iris
Name: sklearn-iris
Namespace: default
Type: KServe
Version: 00001
Desired: 1
Available: 1
Age: 3m
Address: http://sklearn-iris.default.example.com
Port: :80

LatestRevision: sklearn-iris-predictor-00001
LatestPrecent: 100

Instances:
NAME STATUS AGE READY RESTARTS NODE
---- ------ --- ----- -------- ----
sklearn-iris-predictor-00001-deployment-7b4677c6b7-8cr84 Running 3m 2/2 0 192.168.5.239

4\. Perform inference

First, prepare your inference input request inside a file:

$ cat <<EOF > "./iris-input.json"
{
"instances": [
[6.8, 2.8, 4.8, 1.4],
[6.0, 3.4, 4.5, 1.6]
]
}
EOF

you can curl with the ingress gateway external IP using the HOST Header.

$ curl -H "Host: sklearn-iris.default.example.com" http://${INGRESS_HOST}:80/v1/models/sklearn-iris:predict -d @./iris-input.json

5\. Update the InferenceService with the canary rollout strategy

Add the canaryTrafficPercent field to the predictor component and update the storageUri to use a new/updated model.

$ arena serve update kserve \
--name sklearn-iris \
--canary-traffic-percent=10 \
--storage-uri=gs://kfserving-examples/models/sklearn/1.0/model-2

After rolling out the canary model, traffic is split between the latest ready revision 2 and the previously rolled out revision 1.

$ arena serve get sklearn-iris
Name: sklearn-iris
Namespace: default
Type: KServe
Version: 00002
Desired: 2
Available: 2
Age: 26m
Address: http://sklearn-iris.default.example.com
Port: :80

LatestRevision: sklearn-iris-predictor-00002
LatestPrecent: 10
PrevRevision: sklearn-iris-predictor-00001
PrevPrecent: 90

Instances:
NAME STATUS AGE READY RESTARTS NODE
---- ------ --- ----- -------- ----
sklearn-iris-predictor-00001-deployment-7b4677c6b7-8cr84 Running 25m 2/2 0 192.168.5.239
sklearn-iris-predictor-00002-deployment-7f677b9fd6-2dtpg Running 3m 2/2 0 192.168.5.241

6\. Promote the canary model

If the canary model is healthy/passes your tests, you can set canary-traffic-percent to 100.

$ arena serve update kserve \
--name sklearn-iris \
--canary-traffic-percent=100

Now all traffic goes to the revision 2 for the new model. The pods for revision generation 1 automatically scales down to 0 as it is no longer getting the traffic.

$ arena serve get sklearn-iris
Name: sklearn-iris
Namespace: default
Type: KServe
Version: 00002
Desired: 1
Available: 1
Age: 32m
Address: http://sklearn-iris.default.example.com
Port: :80

LatestRevision: sklearn-iris-predictor-00002
LatestPrecent: 100

Instances:
NAME STATUS AGE READY RESTARTS NODE
---- ------ --- ----- -------- ----
sklearn-iris-predictor-00001-deployment-7b4677c6b7-8cr84 Terminating 31m 1/2 0 192.168.5.239
sklearn-iris-predictor-00002-deployment-7f677b9fd6-2dtpg Running 9m 2/2 0 192.168.5.241

7\. Delete the kserve job

$ arena serve delete sklearn-iris











3 changes: 3 additions & 0 deletions pkg/serving/delete.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,9 @@ func DeleteServingJob(namespace, name, version string, jobType types.ServingJobT
return err
}
nameWithVersion := fmt.Sprintf("%v-%v", job.Name(), job.Version())
if job.Type() == types.KServeJob {
nameWithVersion = job.Name()
}
servingType := string(job.Type())
err = workflow.DeleteJob(nameWithVersion, namespace, servingType)
if err != nil {
Expand Down

0 comments on commit 14fa45c

Please sign in to comment.