Mamba based LLMs aren't even close to novel though. IBM's been doing this since forever [1].
Also, you're off on Deepseek V3.2's param count, the full model's 685B in size with the MTP layer.
I don't think there's anything interesting here other than "I guess AMD put out a research paper", and it's not cutting edge when Deepseek or even IBM is running laps around them.
Also, you're off on Deepseek V3.2's param count, the full model's 685B in size with the MTP layer.
I don't think there's anything interesting here other than "I guess AMD put out a research paper", and it's not cutting edge when Deepseek or even IBM is running laps around them.
[1] Here's a news article from April, although IBM has been doing it for a long time before that https://research.ibm.com/blog/bamba-ssm-transformer-model