Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't pass workers_per_resource to the bentoml container #901

Open
hahmad2008 opened this issue Feb 12, 2024 · 2 comments
Open

Can't pass workers_per_resource to the bentoml container #901

hahmad2008 opened this issue Feb 12, 2024 · 2 comments
Assignees

Comments

@hahmad2008
Copy link

hahmad2008 commented Feb 12, 2024

Describe the bug

I have a machine with two GPUs, I run the model with openllm start command and everything went well.
CUDA_VISIBLE_DEVICES=0,1 TRANSFORMERS_OFFLINE=1 openllm start mistral --model-id mymodel --dtype float16 --gpu-memory-utilization 0.95 --workers-per-resource 0.5

  • there are two process appear on the two GPUs in this case one for the service and another for ray instance.

when I run start command without --gpu-memory-utilization 0.95 --workers-per-resource 0.5, only one GPU is running the service and CUDA out of memory is occured.

When I build the image and follow the steps to create container, however when i run the docker image, it issue error of cuda out of memory, such as the second case without passing these args: --gpu-memory-utilization 0.95 --workers-per-resource 0.5

steps:

  • openllm build mymodel --backend vllm --serialization safetensors
  • bentoml containerize mymodel-service:12345 --opt progress=plain
  • docker run --rm --gpus all -p 3000:3000 -it mymodel-service:12345

To reproduce

No response

Logs

No response

Environment

$ bentoml -v
bentoml, version 1.1.11

$openllm -v
openllm, 0.4.45.dev2 (compiled: False)
Python (CPython) 3.11.7

System information (Optional)

No response

@hahmad2008
Copy link
Author

@aarnphm What is the difference between the previous two cases, so the first case can launch two processes one for ray worker and other for bentoml service (that when using --gpu-memory-utilization 0.95 --workers-per-resource 0.5

@jeremyadamsfisher
Copy link

Same issue: #872

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants