You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, author, thank you for sharing your fantastic work.
When I was reading things about mamba, I found that in mamba-mini , it says that GLA is a special case of mamba and when the dimension of w changed from [batch, seqlen, dtstate, dim] to [batch, seqlen, dtstate], then they are equalivent.
The author of VMamba also suggested that in the arxiv paper:
Then I found that in the SSD, which is the core component of mamba2, the dimension of matrix A$\odot$dt is also reduced to [..., nheads], which is may suggests that the matrix w has been reduced to [batch, seqlen]. So my question is, is mamba2 a special case of GLA?
Moreover, I did an experiment testing mamba2 and GLA, and found that they almost share the same result with each other, only with numerical differences (1e-5).
So, How to understand the relationship between mamba1, mamba2 and GLA?
The text was updated successfully, but these errors were encountered:
alpacaduby
changed the title
What is the difference between mamba2 and GLA?
How to understand the relationship between mamba2 and GLA?
Jun 10, 2024
Hi, author, thank you for sharing your fantastic work.
When I was reading things about
mamba
, I found that in mamba-mini , it says thatGLA
is a special case ofmamba
and when the dimension ofw
changed from[batch, seqlen, dtstate, dim]
to[batch, seqlen, dtstate]
, then they are equalivent.The author of VMamba also suggested that in the arxiv paper:
![image](https://private-user-images.githubusercontent.com/98697689/338186356-200af534-4466-4d52-9c1d-4c630a30e9fb.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTk1Mzg3NjYsIm5iZiI6MTcxOTUzODQ2NiwicGF0aCI6Ii85ODY5NzY4OS8zMzgxODYzNTYtMjAwYWY1MzQtNDQ2Ni00ZDUyLTljMWQtNGM2MzBhMzBlOWZiLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MjglMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjI4VDAxMzQyNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWRjMDhhOTA1ZmE1MWMyZmNjMjY2MTg0N2YyNWFmNDhhMTkzMDlmNWU5NDA3YjcyNTI2N2EyMmI5MjI1OWMxNjUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.tDEmS-yrY0hM9heOJNY2K6Ystd_oP2xqDHgECjlk4cM)
Then I found that in the$\odot$
SSD
, which is the core component ofmamba2
, the dimension of matrixA
dt
is also reduced to [..., nheads], which is may suggests that the matrixw
has been reduced to[batch, seqlen]
. So my question is, ismamba2
a special case of GLA?Moreover, I did an experiment testing
mamba2
andGLA,
and found that they almost share the same result with each other, only with numerical differences (1e-5).So, How to understand the relationship between
mamba1
,mamba2
andGLA
?The text was updated successfully, but these errors were encountered: