-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
psutil 5.9.6 seems to be throwing ZombieProcess when retrieving the mms process #132
Comments
I am having same issue with sg inference: 1.10.1 and multimodel server: 1.1.11 |
Same problem with |
Try updating python version as well i updated ubuntu version of my docker version. @andre-marcos-perez |
Hey, installing RUN pip3 install --upgrade pip && \
pip3 install multi-model-server==1.1.8 && \
pip3 install psutil==5.9.5 && \
pip3 install sagemaker-inference==1.7.1 |
Likely solved by #133 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
We use a custom image for our Sagemaker endpoint, and on Friday, Oct 20, 2023, we experienced instability in our endpoint after re-deploying. It seems that the latest version fo psutil 5.9.6 will throw ZombieProcess more frequently, causing the server to restart. This causes the endpoint to occasionally return non-200 responses when predictions are requested.
The change in psutil may be this fix on their end with what they recognize as a ZombieProcess.
giampaolo/psutil#2288
We were able to resolve our issue by rolling back to psutil 5.9.5. So, I'm unsure if sagemaker-inference should pin the version of psutil in your package or if the fix needs to be done here:
https://github.com/aws/sagemaker-inference-toolkit/blob/master/src/sagemaker_inference/model_server.py#L276
To reproduce
Create a custom sagemaker endpoint image with psutil 5.9.6 and deploy it.
Expected behavior
The model endpoint is stable and consistently returns successful predictions and the ZombieProcess exception is not being raised frequently.
Screenshots or logs
Here is a traceback we are seeing:
System information
Additional context
n/a
The text was updated successfully, but these errors were encountered: