so given that GGML can serve like 100 tok/s on an M2 Max, and this thing advertises 6 tok/s distributed, is this basically for people with lower end devices?
It's talking about 70B and 160B models. Even heavily quantized can ggml run those that fast? (I'm guessing possibly). So maybe this is for people that dont have a high end computer? I have a decent linux laptop a couple years old and there's no way I could run those models that fast. I get a few tokens per second on a quantized 7B model.