You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Dear
I am trying to load full model on A100-80 GB of 8 cores using below command. docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --max-input-length 8000 --max-total-tokens 8010
However, it is not using all GPU core.
I also looked num_shard, but didn't get it.
Can you help here to to use all core and optimize the above command. The main concern is that we need to decrease inference time for production grade.
Thanks
The text was updated successfully, but these errors were encountered:
Dear
I am trying to load full model on A100-80 GB of 8 cores using below command.
docker run --gpus all --shm-size 1g -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --max-input-length 8000 --max-total-tokens 8010
However, it is not using all GPU core.
I also looked num_shard, but didn't get it.
Can you help here to to use all core and optimize the above command. The main concern is that we need to decrease inference time for production grade.
Thanks
The text was updated successfully, but these errors were encountered: