This repo is for load testing the FastChat OpenAI compatible REST API. Before giving users access to the API, we should have some understanding of the load that it can handle.
The python script fastchat-api-load-testing.py
is executed via a Kubernetes indexed job.
This allows the program to divide its input chat_prompts
among several pods that are executed in parallel to simulate the desired load.
You should create a file env.yaml
based on the provided env-template.yaml
replacing the value of api_key
with your FastChat API key.
You must then create a secret in your namespace from the env.yaml
file like so:
kubectl -n [your-namespace] create secret generic fastchat-api --from-file=env=env.yaml
Begin by cloning this repo to your computer.
You can then run the indexed job in your namespace from the root directory of this repo:
kubectl -n [your-namespace] apply -f manifest.yaml
The FastChat deployment can be monitored for usage via the Nautilus Grafana dashboard for GPU usage by namespace, filtering by the namespace "sdsu-llm". The FastChat API server can also be monitored via logs by:
kubectl -n sdsu-llm logs -f [pod] -c fastchat-api