Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mamba2 training speed is very very very slow #389

Open
with45 opened this issue Jun 12, 2024 · 8 comments
Open

mamba2 training speed is very very very slow #389

with45 opened this issue Jun 12, 2024 · 8 comments

Comments

@with45
Copy link

with45 commented Jun 12, 2024

I change the mamba to mamba2,trainning spped very very slow why?

@catalpaaa
Copy link

For my task, image classification, mamba 1 takes 40 mins to run one epoch on rtx 6000 ada and mamba 2 takes only 20.

mamba 2 also use less ram too!

@vasqu
Copy link

vasqu commented Jun 12, 2024

See #355

Although, I've also encountered issues similar to those being described in the issue later on (graph compilation errors). I'd also suggest using torch==2.2.0 with triton 2.2.0 (no idea why but it ran faster than 2.3.0 in my case).

@Gaodzlearn
Copy link

I also encountered this problem. Running the demo takes around 30 seconds:

Code

from mamba_ssm import Mamba2
import torch
import time

# Create a random input tensor
x = torch.randn(1, 4, 256).to("cuda")
dim = 256

model = Mamba2(
    # This module uses roughly 3 * expand * d_model^2 parameters
    d_model=dim, # Model dimension d_model
    d_state=64,  # SSM state expansion factor, typically 64 or 128
    d_conv=4,    # Local convolution width
    expand=2,    # Block expansion factor
).to("cuda")
t1 = time.time()
y = model(x)
assert y.shape == x.shape
print(f"Time taken: {time.time() - t1:.3f} s")

Output

Time taken: 32.440 s

Environment Information

GPU: NVIDIA A6000,
CPU: AMD EPYC 7513
System: Ubuntu 20.04.6 LTS
Python: 3.9
CUDA: 11.8
Pytorch: 2.3.1
Triton: 2.3.1
Transformers: 4.41.2

Additional

I tried adding a decorator in ssd_combined.py as suggested by @Kiet0712 in this comment, but it resulted in a bug similar to what @arelkeselbri described in this comment.

Is this inference speed for the demo normal? Or is there something wrong with my code? I would appreciate any help or suggestions!

@tridao
Copy link
Collaborator

tridao commented Jun 16, 2024

Try warming up by running it once first. The first time will invoke the triton compiler & autotune so it'll be slow.

@Gaodzlearn
Copy link

Try warming up by running it once first. The first time will invoke the triton compiler & autotune so it'll be slow.

Thank you so much! The second inference process takes only 0.005 sec.

@JHChen1
Copy link

JHChen1 commented Jun 17, 2024

我也遇到了这个问题。运行演示大约需要30秒:

代码

from mamba_ssm import Mamba2
import torch
import time

# Create a random input tensor
x = torch.randn(1, 4, 256).to("cuda")
dim = 256

model = Mamba2(
    # This module uses roughly 3 * expand * d_model^2 parameters
    d_model=dim, # Model dimension d_model
    d_state=64,  # SSM state expansion factor, typically 64 or 128
    d_conv=4,    # Local convolution width
    expand=2,    # Block expansion factor
).to("cuda")
t1 = time.time()
y = model(x)
assert y.shape == x.shape
print(f"Time taken: {time.time() - t1:.3f} s")

输出

Time taken: 32.440 s

环境信息

GPU:NVIDIA A6000, CPU:AMD EPYC 7513 系统:Ubuntu 20.04.6 LTS Python:3.9 CUDA:11.8 Pytorch:2.3.1 Triton:2.3.1 Transformers:4.41.2

额外的

我尝试按照建议在 ssd_combined.py 中添加一个装饰器@Kiet0712这个评论中,但它导致了一个类似于@arelkeselbri在此评论中描述。

这个演示的推理速度正常吗?还是我的代码有问题?任何帮助或建议我都会很感激!

I used the same code for testing, I tried running it multiple times and there was no improvement in speed.

  1. first: Time taken: 246.404 s
  2. second: Time taken: 245.945 s
  3. third: Time taken: 256.347 s

@Gaodzlearn
Copy link

The compile happens every time you launch 'python demo.py'. Try forward twice in the same script like:

from mamba_ssm import Mamba2
import torch
import time

# Create a random input tensor
x = torch.randn(1, 4, 256).to("cuda")
dim = 256

model = Mamba2(
    # This module uses roughly 3 * expand * d_model^2 parameters
    d_model=dim, # Model dimension d_model
    d_state=64,  # SSM state expansion factor, typically 64 or 128
    d_conv=4,    # Local convolution width
    expand=2,    # Block expansion factor
).to("cuda")

# warm up
y = model(x)

t1 = time.time()
y = model(x)
assert y.shape == x.shape
print(f"Time taken: {time.time() - t1:.3f} s")

@JHChen1
Copy link

JHChen1 commented Jun 17, 2024

@Gaodzlearn The problem is solved, thank you very much for your reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants