New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: CUDA error when running mistral-7b + lora with tensor_para=8 #4756
Comments
Hi @sfc-gh-zhwang, FWIW I was able to run this with TP=2 on 2xA6000 using from vllm import LLM
llm = LLM(
model="mistralai/Mistral-7B-Instruct-v0.2",
enable_lora=True,
tensor_parallel_size=2,
)
print(llm.generate("Hello")) Output:
|
@mgoin it's just tp=8 doesn't work. |
@FurtherAI in case you have some idea 😃 |
Further narrow down to this line
where, for tp=8 (error out), the tensor sizes are:
while for tp=4 (working), the tensor sizes are
Still trying to figure out what is the magic around 1024 -> 512 |
Tracked it a little further. Seems to be due to the sequence length. Not sure why, from a brief glance, the kernel shouldn't care about the sequence length. I found 65536 to work, 32768 and 16384 to not work and 8192, 4096 to work and didn't test more. So for now, @sfc-gh-zhwang, run with a different seq length. Here's some code to reproduce: import vllm._punica_C as punica_kernels
seq_length, rank = 32768, 16
buffer = torch.randn((seq_length, rank), device='cuda', dtype=torch.float32)
x = torch.randn((seq_length, 512), device='cuda', dtype=torch.bfloat16)
wa_t_all = torch.randn((1, 1, rank, 512), device='cuda', dtype=torch.bfloat16)
indicies = torch.full((seq_length,), 1, device='cuda', dtype=torch.int64)
punica_kernels.dispatch_bgmv(buffer, x, wa_t_all, indicies, 0, 1.0)
torch.cuda.synchronize() |
Your current environment
🐛 Describe the bug
When running below, enable lora for mistral-7b model with tensor parall=8, will throw cuda error. Full log is here
The text was updated successfully, but these errors were encountered: