I have a M4 Max with 128 GB memory. Even on that machine I would not consider 70B+ models to be useable. Once you go below 20 tokens/s it becomes more like having a pen pal than an AI assistant.
MoE models can still be pretty fast. As are smaller models.
(This is mostly a warning for anyone who is enamored by the idea of running these things locally to make sure to test it before you spend a lot of money.)
Currently I'd probably say the Nvidia RTX pro 6000 is a Challenger if you want local models. It "only" has 96 GB of RAM, but it's very fast (1800 GB/s). If you can fit the model on it and it's good enough for your use case then it's probably worth it even at $10k.
MoE models can still be pretty fast. As are smaller models.
(This is mostly a warning for anyone who is enamored by the idea of running these things locally to make sure to test it before you spend a lot of money.)
Currently I'd probably say the Nvidia RTX pro 6000 is a Challenger if you want local models. It "only" has 96 GB of RAM, but it's very fast (1800 GB/s). If you can fit the model on it and it's good enough for your use case then it's probably worth it even at $10k.