You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
According to the formula at the arXiv article, the parameter count of the RWKV-4-world 0.1B model ($L=12, D=768, V=65536$) would be: $$2VD + 13D^2 L + D(11L+4)$$
which yields
emb.weight f32 cpu 65536 768
blocks.0.ln1.weight f32 cpu 768
blocks.0.ln1.bias f32 cpu 768
blocks.0.ln2.weight f32 cpu 768
blocks.0.ln2.bias f32 cpu 768
blocks.0.att.time_decay f32 cpu 768
blocks.0.att.time_first f32 cpu 768
blocks.0.att.time_mix_k f32 cpu 768
blocks.0.att.time_mix_v f32 cpu 768
blocks.0.att.time_mix_r f32 cpu 768
blocks.0.att.key.weight f32 cpu 768 768
blocks.0.att.value.weight f32 cpu 768 768
blocks.0.att.receptance.weight f32 cpu 768 768
blocks.0.att.output.weight f32 cpu 768 768
blocks.0.ffn.time_mix_k f32 cpu 768
blocks.0.ffn.time_mix_r f32 cpu 768
blocks.0.ffn.key.weight f32 cpu 768 3072
blocks.0.ffn.receptance.weight f32 cpu 768 768
blocks.0.ffn.value.weight f32 cpu 3072 768
....................................................................................................................................................................................
blocks.11.ln1.weight f32 cpu 768
blocks.11.ln1.bias f32 cpu 768
blocks.11.ln2.weight f32 cpu 768
blocks.11.ln2.bias f32 cpu 768
blocks.11.att.time_decay f32 cpu 768
blocks.11.att.time_first f32 cpu 768
blocks.11.att.time_mix_k f32 cpu 768
blocks.11.att.time_mix_v f32 cpu 768
blocks.11.att.time_mix_r f32 cpu 768
blocks.11.att.key.weight f32 cpu 768 768
blocks.11.att.value.weight f32 cpu 768 768
blocks.11.att.receptance.weight f32 cpu 768 768
blocks.11.att.output.weight f32 cpu 768 768
blocks.11.ffn.time_mix_k f32 cpu 768
blocks.11.ffn.time_mix_r f32 cpu 768
blocks.11.ffn.key.weight f32 cpu 768 3072
blocks.11.ffn.receptance.weight f32 cpu 768 768
blocks.11.ffn.value.weight f32 cpu 3072 768
ln_out.weight f32 cpu 768
ln_out.bias f32 cpu 768
head.weight f32 cpu 768 65536
So, $2VD$ comes from
emb.weight f32 cpu 65536 768
head.weight f32 cpu 768 65536
$13D^2L$ comes from
blocks.0.att.key.weight f32 cpu 768 768
blocks.0.att.value.weight f32 cpu 768 768
blocks.0.att.receptance.weight f32 cpu 768 768
blocks.0.att.output.weight f32 cpu 768 768
blocks.0.ffn.key.weight f32 cpu 768 3072
blocks.0.ffn.receptance.weight f32 cpu 768 768
blocks.0.ffn.value.weight f32 cpu 3072 768
$11DL$ comes from
blocks.0.ln1.weight f32 cpu 768
blocks.0.ln1.bias f32 cpu 768
blocks.0.ln2.weight f32 cpu 768
blocks.0.ln2.bias f32 cpu 768
blocks.0.att.time_decay f32 cpu 768
blocks.0.att.time_first f32 cpu 768
blocks.0.att.time_mix_k f32 cpu 768
blocks.0.att.time_mix_v f32 cpu 768
blocks.0.att.time_mix_r f32 cpu 768
blocks.0.ffn.time_mix_k f32 cpu 768
blocks.0.ffn.time_mix_r f32 cpu 768
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
According to the formula at the arXiv article, the parameter count of the RWKV-4-world 0.1B model ($L=12, D=768, V=65536$ ) would be:
$$2VD + 13D^2 L + D(11L+4)$$
which yields
But the output is
So,$2VD$ comes from
However! Where is$4D$ ? these are only $2D$ :
Am I missing something?
Beta Was this translation helpful? Give feedback.
All reactions