This is a good point. Even if ordinary people did have the resources to run the ...

lhl · on April 10, 2023

I'm pretty confident that the landscape is going to look very different by the end of the year, as there are so many people poking around now. I think that significantly smaller models will definitely be good enough for specialized tasks, but an equivalently tuned larger model will always be better, the question is by how much. On Meta's benchmarks [1], there's only a tiny gap between 30B and 65B, for example.

For 65B, GPTQ 4-bit should fit LLaMA 65B into 40GiB of memory. Currently the cheapest way to run that at an acceptable speed would be to use 2 x RTX 3090/4090s (~$2500-3000) or maybe a Jetson Orin 64GB (~$2000). I've seen people trying to run it on an M1 Max and it's just a bit too slow to comfortably use (I get a similar speed to when I try it on my 5950X - about 1-2 tokens/s), but it seems like it's within a factor or two of being fast enough, so not out of the question that it might get there just through software optimizations. I'd definitely upgrade to a 7950X/X3D or a Threadripper (w/ 96GB of DDR5-5200) if I could get 65B running at a comfortable speed all the time.

I think training is also advancing at a pretty good clip. LLaMA-adapter [2] is doing fine tuning of LLaMA 13B on a single 8xA100 system in 1h (so for ~$12 for a spot instance) and was already over 3X faster than Alpaca's training.

To me, the biggest thing limiting easy plug-and-play distribution is actually LLaMA's licensing issues, so maybe someone will offer a better open foundational model soon and the community can standardize on that. It'd be nice to have a larger context window (Flash Attention?) as well.

[1] https://github.com/facebookresearch/llama/blob/main/MODEL_CA...

[2] https://github.com/ZrrSkywalker/LLaMA-Adapter

alxfoster · on April 10, 2023

FYI, many of us are indeed running 65B. I’m running 65B at 4-bit and getting about 7.5 tokens per second. Granted, I have a beefy machine with 2x 3090s and Nvlink but certainly well within the realm of any small lab.