mamba2 training speed is very very very slow #389

with45 · 2024-06-12T13:02:32Z

I change the mamba to mamba2,trainning spped very very slow why?

catalpaaa · 2024-06-12T14:15:50Z

For my task, image classification, mamba 1 takes 40 mins to run one epoch on rtx 6000 ada and mamba 2 takes only 20.

mamba 2 also use less ram too!

vasqu · 2024-06-12T17:52:20Z

Although, I've also encountered issues similar to those being described in the issue later on (graph compilation errors). I'd also suggest using torch==2.2.0 with triton 2.2.0 (no idea why but it ran faster than 2.3.0 in my case).

Gaodzlearn · 2024-06-16T15:23:53Z

I also encountered this problem. Running the demo takes around 30 seconds:

Code

from mamba_ssm import Mamba2
import torch
import time

# Create a random input tensor
x = torch.randn(1, 4, 256).to("cuda")
dim = 256

model = Mamba2(
    # This module uses roughly 3 * expand * d_model^2 parameters
    d_model=dim, # Model dimension d_model
    d_state=64,  # SSM state expansion factor, typically 64 or 128
    d_conv=4,    # Local convolution width
    expand=2,    # Block expansion factor
).to("cuda")
t1 = time.time()
y = model(x)
assert y.shape == x.shape
print(f"Time taken: {time.time() - t1:.3f} s")

Output

Time taken: 32.440 s

Environment Information

GPU: NVIDIA A6000,
CPU: AMD EPYC 7513
System: Ubuntu 20.04.6 LTS
Python: 3.9
CUDA: 11.8
Pytorch: 2.3.1
Triton: 2.3.1
Transformers: 4.41.2

Additional

I tried adding a decorator in ssd_combined.py as suggested by @Kiet0712 in this comment, but it resulted in a bug similar to what @arelkeselbri described in this comment.

Is this inference speed for the demo normal? Or is there something wrong with my code? I would appreciate any help or suggestions!

tridao · 2024-06-16T16:10:29Z

Try warming up by running it once first. The first time will invoke the triton compiler & autotune so it'll be slow.

Gaodzlearn · 2024-06-16T16:27:08Z

Try warming up by running it once first. The first time will invoke the triton compiler & autotune so it'll be slow.

Thank you so much! The second inference process takes only 0.005 sec.

JHChen1 · 2024-06-17T07:46:45Z

我也遇到了这个问题。运行演示大约需要30秒：

代码
from mamba_ssm import Mamba2
import torch
import time

# Create a random input tensor
x = torch.randn(1, 4, 256).to("cuda")
dim = 256

model = Mamba2(
    # This module uses roughly 3 * expand * d_model^2 parameters
    d_model=dim, # Model dimension d_model
    d_state=64,  # SSM state expansion factor, typically 64 or 128
    d_conv=4,    # Local convolution width
    expand=2,    # Block expansion factor
).to("cuda")
t1 = time.time()
y = model(x)
assert y.shape == x.shape
print(f"Time taken: {time.time() - t1:.3f} s")
输出

Time taken: 32.440 s

环境信息

GPU：NVIDIA A6000， CPU：AMD EPYC 7513 系统：Ubuntu 20.04.6 LTS Python：3.9 CUDA：11.8 Pytorch：2.3.1 Triton：2.3.1 Transformers：4.41.2

额外的

我尝试按照建议在 ssd_combined.py 中添加一个装饰器@Kiet0712在这个评论中，但它导致了一个类似于@arelkeselbri 在此评论中描述。

这个演示的推理速度正常吗？还是我的代码有问题？任何帮助或建议我都会很感激！

I used the same code for testing, I tried running it multiple times and there was no improvement in speed.

first： Time taken: 246.404 s
second： Time taken: 245.945 s
third： Time taken: 256.347 s

Gaodzlearn · 2024-06-17T07:57:28Z

The compile happens every time you launch 'python demo.py'. Try forward twice in the same script like:

from mamba_ssm import Mamba2
import torch
import time

# Create a random input tensor
x = torch.randn(1, 4, 256).to("cuda")
dim = 256

model = Mamba2(
    # This module uses roughly 3 * expand * d_model^2 parameters
    d_model=dim, # Model dimension d_model
    d_state=64,  # SSM state expansion factor, typically 64 or 128
    d_conv=4,    # Local convolution width
    expand=2,    # Block expansion factor
).to("cuda")

# warm up
y = model(x)

t1 = time.time()
y = model(x)
assert y.shape == x.shape
print(f"Time taken: {time.time() - t1:.3f} s")

JHChen1 · 2024-06-17T08:40:16Z

@Gaodzlearn The problem is solved, thank you very much for your reply.

TimothyChen225 mentioned this issue Jun 21, 2024

Mamba2 9 times slower inference time than Mamba1 #355

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mamba2 training speed is very very very slow #389

mamba2 training speed is very very very slow #389

with45 commented Jun 12, 2024

catalpaaa commented Jun 12, 2024

vasqu commented Jun 12, 2024 •

edited

Loading

Gaodzlearn commented Jun 16, 2024

tridao commented Jun 16, 2024

Gaodzlearn commented Jun 16, 2024

JHChen1 commented Jun 17, 2024

代码

输出

环境信息

额外的

Gaodzlearn commented Jun 17, 2024

JHChen1 commented Jun 17, 2024

mamba2 training speed is very very very slow #389

mamba2 training speed is very very very slow #389

Comments

with45 commented Jun 12, 2024

catalpaaa commented Jun 12, 2024

vasqu commented Jun 12, 2024 • edited Loading

Gaodzlearn commented Jun 16, 2024

Code

Output

Environment Information

Additional

tridao commented Jun 16, 2024

Gaodzlearn commented Jun 16, 2024

JHChen1 commented Jun 17, 2024

代码

输出

环境信息

额外的

Gaodzlearn commented Jun 17, 2024

JHChen1 commented Jun 17, 2024

vasqu commented Jun 12, 2024 •

edited

Loading