diff --git a/docs/serving/index.md b/docs/serving/index.md index d269f4f00..07b9ed83d 100644 --- a/docs/serving/index.md +++ b/docs/serving/index.md @@ -40,3 +40,8 @@ If you want to use arena to manage serving jobs, this guide is for you. we have * I want to [submit a nvidia triton serving job which use gpus](triton/serving.md). * I want to [update a nvidia triton serving job after deployed](triton/update-serving.md). + +## KServe Job Guide + +* I want to [submit a kserve job with supported serving runtime](kserve/sklearn.md) +* I want to [submit a kserve job with custom serving runtime](kserve/custom.md) diff --git a/docs/serving/kserve/custom.md b/docs/serving/kserve/custom.md new file mode 100644 index 000000000..7bfdd0a61 --- /dev/null +++ b/docs/serving/kserve/custom.md @@ -0,0 +1,142 @@ +# KServe job with custom serving runtime + +This guide walks through the steps to deploy and serve a custom serving runtime with kserve. + +1\. Setup + +Follow the [KServe Guide](https://kserve.github.io/website/master/admin/serverless/serverless/) to install Kserve. + +2\. Submit your serving job into kserve + +create a PVC 'training-data' before, and then download the 'bloom-560m' model from HuggingFace to the PVC. + +deploy an InferenceService with a predictor that will load a bloom model with text-generation-inference. + + $ arena serve kserve \ + --name=bloom-560m \ + --image=ghcr.io/huggingface/text-generation-inference:1.0.2 \ + --gpus=1 \ + --cpu=12 \ + --memory=50Gi \ + --port=8080 \ + --env=STORAGE_URI=pvc://training-data \ + "text-generation-launcher --disable-custom-kernels --model-id /mnt/models/bloom-560m --num-shard 1 -p 8080" + + inferenceservice.serving.kserve.io/bloom-560m created + INFO[0010] The Job bloom-560m has been submitted successfully + INFO[0010] You can run `arena serve get bloom-560m --type kserve -n default` to check the job status + +3\. Check the status of KServe job + + $ arena serve list + NAME TYPE VERSION DESIRED AVAILABLE ADDRESS PORTS + bloom-560m KServe 00001 1 1 http://bloom-560m.default-group.example.com :80 1 + + $ arena serve get sklearn-iris + Name: bloom-560m + Namespace: default + Type: KServe + Version: 00001 + Desired: 1 + Available: 1 + Age: 7m + Address: http://bloom-560m.default.example.com + Port: :80 + GPU: 1 + + LatestRevision: bloom-560m-predictor-00001 + LatestPrecent: 100 + + Instances: + NAME STATUS AGE READY RESTARTS GPU NODE + ---- ------ --- ----- -------- --- ---- + bloom-560m-predictor-00001-deployment-56b8bdbf87-sg8v8 Running 7m 2/2 0 1 192.168.5.241 + +4\. Perform inference + +you can curl with the ingress gateway external IP using the HOST Header. + + $ curl -H "Host: bloom-560m.default.example.com" http://${INGRESS_HOST}:80/generate \ + -X POST \ + -d '{"inputs":"What is Deep Learning?","parameters":{"max_new_tokens":17}}' \ + -H 'Content-Type: application/json' + + {"generated_text":" Deep Learning is a new type of machine learning that is used to solve complex problems."} + +5\. Update the InferenceService with the canary rollout strategy + +Add the canaryTrafficPercent field to the predictor component and update command to use a new/updated model path /mnt/models/bloom-560m-v2. + + $ arena serve update kserve \ + --name bloom-560m \ + --canary-traffic-percent=10 \ + "text-generation-launcher --disable-custom-kernels --model-id /mnt/models/bloom-560m-v2 --num-shard 1 -p 8036" + +After rolling out the canary model, traffic is split between the latest ready revision 2 and the previously rolled out revision 1. + + $ arena serve get bloom-560m + Name: bloom-560m + Namespace: default + Type: KServe + Version: 00002 + Desired: 2 + Available: 2 + Age: 26m + Address: http://bloom-560m.default.example.com + Port: :80 + + LatestRevision: bloom-560m-predictor-00002 + LatestPrecent: 10 + PrevRevision: bloom-560m-predictor-00001 + PrevPrecent: 90 + + Instances: + NAME STATUS AGE READY RESTARTS GPU NODE + ---- ------ --- ----- -------- --- ---- + bloom-560m-predictor-00001-deployment-56b8bdbf87-sg8v8 Running 19m 2/2 0 1 192.168.5.241 + bloom-560m-predictor-00002-deployment-84dbb64cc4-647wx Running 2m 2/2 0 1 192.168.5.239 + +6\. Promote the canary model + +If the canary model is healthy/passes your tests, you can set canary-traffic-percent to 100. + + $ arena serve update kserve \ + --name bloom-560m \ + --canary-traffic-percent=100 + +Now all traffic goes to the revision 2 for the new model. The pods for revision generation 1 automatically scales down to 0 as it is no longer getting the traffic. + + $ arena serve get bloom-560m + Name: bloom-560m + Namespace: default + Type: KServe + Version: 00002 + Desired: 2 + Available: 2 + Age: 26m + Address: http://bloom-560m.default.example.com + Port: :80 + + LatestRevision: bloom-560m-predictor-00002 + LatestPrecent: 100 + + Instances: + NAME STATUS AGE READY RESTARTS GPU NODE + ---- ------ --- ----- -------- --- ---- + bloom-560m-predictor-00001-deployment-56b8bdbf87-sg8v8 Terminating 22m 1/2 0 0 192.168.5.241 + bloom-560m-predictor-00002-deployment-84dbb64cc4-647wx Running 5m 2/2 0 1 192.168.5.239 + +7\. Delete the kserve job + + $ arena serve delete sklearn-iris + + + + + + + + + + + diff --git a/docs/serving/kserve/sklearn.md b/docs/serving/kserve/sklearn.md new file mode 100644 index 000000000..450b48ce9 --- /dev/null +++ b/docs/serving/kserve/sklearn.md @@ -0,0 +1,140 @@ +# KServe job with supported serving runtime + +This guide walks through the steps to deploy and serve a supported serving runtime with kserve. + +1\. Setup + +Follow the [KServe Guide](https://kserve.github.io/website/master/admin/serverless/serverless/) to install Kserve. + +2\. Submit your serving job into kserve + +deploy an InferenceService with a predictor that will load a scikit-learn model. + + $ arena serve kserve \ + --name=sklearn-iris \ + --model-format=sklearn \ + --storage-uri=gs://kfserving-examples/models/sklearn/1.0/model + + inferenceservice.serving.kserve.io/sklearn-iris created + INFO[0009] The Job sklearn-iris has been submitted successfully + INFO[0009] You can run `arena serve get sklearn-iris --type kserve -n default` to check the job status + +3\. Check the status of KServe job + + $ arena serve list + NAME TYPE VERSION DESIRED AVAILABLE ADDRESS PORTS + sklearn-iris KServe 00001 1 1 http://sklearn-iris.default.example.com :80 + + $ arena serve get sklearn-iris + Name: sklearn-iris + Namespace: default + Type: KServe + Version: 00001 + Desired: 1 + Available: 1 + Age: 3m + Address: http://sklearn-iris.default.example.com + Port: :80 + + LatestRevision: sklearn-iris-predictor-00001 + LatestPrecent: 100 + + Instances: + NAME STATUS AGE READY RESTARTS NODE + ---- ------ --- ----- -------- ---- + sklearn-iris-predictor-00001-deployment-7b4677c6b7-8cr84 Running 3m 2/2 0 192.168.5.239 + +4\. Perform inference + +First, prepare your inference input request inside a file: + + $ cat < "./iris-input.json" + { + "instances": [ + [6.8, 2.8, 4.8, 1.4], + [6.0, 3.4, 4.5, 1.6] + ] + } + EOF + +you can curl with the ingress gateway external IP using the HOST Header. + + $ curl -H "Host: sklearn-iris.default.example.com" http://${INGRESS_HOST}:80/v1/models/sklearn-iris:predict -d @./iris-input.json + +5\. Update the InferenceService with the canary rollout strategy + +Add the canaryTrafficPercent field to the predictor component and update the storageUri to use a new/updated model. + + $ arena serve update kserve \ + --name sklearn-iris \ + --canary-traffic-percent=10 \ + --storage-uri=gs://kfserving-examples/models/sklearn/1.0/model-2 + +After rolling out the canary model, traffic is split between the latest ready revision 2 and the previously rolled out revision 1. + + $ arena serve get sklearn-iris + Name: sklearn-iris + Namespace: default + Type: KServe + Version: 00002 + Desired: 2 + Available: 2 + Age: 26m + Address: http://sklearn-iris.default.example.com + Port: :80 + + LatestRevision: sklearn-iris-predictor-00002 + LatestPrecent: 10 + PrevRevision: sklearn-iris-predictor-00001 + PrevPrecent: 90 + + Instances: + NAME STATUS AGE READY RESTARTS NODE + ---- ------ --- ----- -------- ---- + sklearn-iris-predictor-00001-deployment-7b4677c6b7-8cr84 Running 25m 2/2 0 192.168.5.239 + sklearn-iris-predictor-00002-deployment-7f677b9fd6-2dtpg Running 3m 2/2 0 192.168.5.241 + +6\. Promote the canary model + +If the canary model is healthy/passes your tests, you can set canary-traffic-percent to 100. + + $ arena serve update kserve \ + --name sklearn-iris \ + --canary-traffic-percent=100 + +Now all traffic goes to the revision 2 for the new model. The pods for revision generation 1 automatically scales down to 0 as it is no longer getting the traffic. + + $ arena serve get sklearn-iris + Name: sklearn-iris + Namespace: default + Type: KServe + Version: 00002 + Desired: 1 + Available: 1 + Age: 32m + Address: http://sklearn-iris.default.example.com + Port: :80 + + LatestRevision: sklearn-iris-predictor-00002 + LatestPrecent: 100 + + Instances: + NAME STATUS AGE READY RESTARTS NODE + ---- ------ --- ----- -------- ---- + sklearn-iris-predictor-00001-deployment-7b4677c6b7-8cr84 Terminating 31m 1/2 0 192.168.5.239 + sklearn-iris-predictor-00002-deployment-7f677b9fd6-2dtpg Running 9m 2/2 0 192.168.5.241 + +7\. Delete the kserve job + + $ arena serve delete sklearn-iris + + + + + + + + + + + diff --git a/pkg/serving/delete.go b/pkg/serving/delete.go index 658600eb8..c0a7ef47a 100644 --- a/pkg/serving/delete.go +++ b/pkg/serving/delete.go @@ -19,6 +19,9 @@ func DeleteServingJob(namespace, name, version string, jobType types.ServingJobT return err } nameWithVersion := fmt.Sprintf("%v-%v", job.Name(), job.Version()) + if job.Type() == types.KServeJob { + nameWithVersion = job.Name() + } servingType := string(job.Type()) err = workflow.DeleteJob(nameWithVersion, namespace, servingType) if err != nil {