Skip to content

Commit

Permalink
Add documentation vLLM for multiple GPUs
Browse files Browse the repository at this point in the history
  • Loading branch information
rlouf committed Apr 15, 2024
1 parent c680744 commit f2af45a
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 0 deletions.
15 changes: 15 additions & 0 deletions docs/reference/models/vllm.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,6 +84,21 @@ model = models.vllm("https://huggingface.co/squeeze-ai-lab/sq-llama-30b-w4-s5",

To use GPTQ models you need to install the autoGTPQ and optimum libraries `pip install auto-gptq optimum`.


### Multi-GPU usage

To run multi-GPU inference with vLLM you need to set the `tensor_parallel_size` argument to the number of GPUs available when initializing the model. For instance to run inference on 2 GPUs:


```python
from outlines import models

model = models.vllm(
"mistralai/Mistral-7B-v0.1"
tensor_parallel_size=2
)
```

### Load LoRA adapters

You can load LoRA adapters and alternate between them dynamically:
Expand Down
8 changes: 8 additions & 0 deletions docs/reference/serve/vllm.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,14 @@ python -m outlines.serve.serve --model="mistralai/Mistral-7B-Instruct-v0.2"

This will by default start a server at `http://127.0.0.1:8000` (check what the console says, though). Without the `--model` argument set, the OPT-125M model is used. The `--model` argument allows you to specify any model of your choosing.

To run inference on multiple GPUs you must pass the `--tensor-parallel-size` argument when initializing the server. For instance, to run inference on 2 GPUs:


```bash
python -m outlines.serve.serve --model="mistralai/Mistral-7B-Instruct-v0.2" --tensor-parallel-size 2
```


### Alternative Method: Via Docker

You can install and run the server with Outlines' official Docker image using the command
Expand Down

0 comments on commit f2af45a

Please sign in to comment.