so given that GGML can serve like 100 tok/s on an M2 Max, and this thing adverti...

version_five · on Sept 17, 2023

It's talking about 70B and 160B models. Even heavily quantized can ggml run those that fast? (I'm guessing possibly). So maybe this is for people that dont have a high end computer? I have a decent linux laptop a couple years old and there's no way I could run those models that fast. I get a few tokens per second on a quantized 7B model.

brucethemoose2 · on Sept 17, 2023

Yeah. My 3090 gets like ~5 tokens/s on 70B Q3KL.

This is a good idea, as splitting up llms is actually pretty efficient with pipelined requests.

russellbeattie · on Sept 17, 2023

> ...lower end devices

So, pretty much every other consumer PC available? Those losers.