Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

so given that GGML can serve like 100 tok/s on an M2 Max, and this thing advertises 6 tok/s distributed, is this basically for people with lower end devices?


It's talking about 70B and 160B models. Even heavily quantized can ggml run those that fast? (I'm guessing possibly). So maybe this is for people that dont have a high end computer? I have a decent linux laptop a couple years old and there's no way I could run those models that fast. I get a few tokens per second on a quantized 7B model.


Yeah. My 3090 gets like ~5 tokens/s on 70B Q3KL.

This is a good idea, as splitting up llms is actually pretty efficient with pipelined requests.


> ...lower end devices

So, pretty much every other consumer PC available? Those losers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: