This is a good point. Even if ordinary people did have the resources to run the 65B well on their existing devices, the speed would limit its usefulness quite a bit. In practice, 30B is what most people are going to interact with (if even, I've seen a lot of projects use 13B).
My experience here is pretty similar. I'm heavily (emotionally at least) invested in models running locally, I refuse to build something around a remote AI that I can only interact with through an API. But I'm not going to pretend that LLaMA has been amazing locally. I really couldn't figure out what to build with it that would be useful.
I'm vaguely hoping that compression actually gets better and that targeted reinforcement/alignment training might change that. GPT can handle a wide range of tasks, but for a smaller AI it wouldn't be too much of a problem to have a much more targeted domain, and at that point maybe the 30B model is actually good enough if it's been refined around a very specific problem domain.
For that to happen, training needs to get more accessible though. Or communities need to start getting together and deciding to build very targeted models and then distributing the weights as "plug-and-play" models you can swap out for different tasks.
And if there's a way to get 65B more accessible, that would be great too.
I'm pretty confident that the landscape is going to look very different by the end of the year, as there are so many people poking around now. I think that significantly smaller models will definitely be good enough for specialized tasks, but an equivalently tuned larger model will always be better, the question is by how much. On Meta's benchmarks [1], there's only a tiny gap between 30B and 65B, for example.
For 65B, GPTQ 4-bit should fit LLaMA 65B into 40GiB of memory. Currently the cheapest way to run that at an acceptable speed would be to use 2 x RTX 3090/4090s (~$2500-3000) or maybe a Jetson Orin 64GB (~$2000). I've seen people trying to run it on an M1 Max and it's just a bit too slow to comfortably use (I get a similar speed to when I try it on my 5950X - about 1-2 tokens/s), but it seems like it's within a factor or two of being fast enough, so not out of the question that it might get there just through software optimizations. I'd definitely upgrade to a 7950X/X3D or a Threadripper (w/ 96GB of DDR5-5200) if I could get 65B running at a comfortable speed all the time.
I think training is also advancing at a pretty good clip. LLaMA-adapter [2] is doing fine tuning of LLaMA 13B on a single 8xA100 system in 1h (so for ~$12 for a spot instance) and was already over 3X faster than Alpaca's training.
To me, the biggest thing limiting easy plug-and-play distribution is actually LLaMA's licensing issues, so maybe someone will offer a better open foundational model soon and the community can standardize on that. It'd be nice to have a larger context window (Flash Attention?) as well.
FYI, many of us are indeed running 65B. Iām running 65B at 4-bit and getting about 7.5 tokens per second. Granted, I have a beefy machine with 2x 3090s and Nvlink but certainly well within the realm of any small lab.
My experience here is pretty similar. I'm heavily (emotionally at least) invested in models running locally, I refuse to build something around a remote AI that I can only interact with through an API. But I'm not going to pretend that LLaMA has been amazing locally. I really couldn't figure out what to build with it that would be useful.
I'm vaguely hoping that compression actually gets better and that targeted reinforcement/alignment training might change that. GPT can handle a wide range of tasks, but for a smaller AI it wouldn't be too much of a problem to have a much more targeted domain, and at that point maybe the 30B model is actually good enough if it's been refined around a very specific problem domain.
For that to happen, training needs to get more accessible though. Or communities need to start getting together and deciding to build very targeted models and then distributing the weights as "plug-and-play" models you can swap out for different tasks.
And if there's a way to get 65B more accessible, that would be great too.