Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I am interested to learn why models move so much data per second. Where could I learn more that is not a ChatGPT session?


Models are made of "parameters" which are really weights in a large neural network. For each token generated, each parameter needs to take its turn inside the CPU/GPU to be calculated.

So if you have a 7B parameter model with 16-bit quantization, that means you'll have 14 GB/s of data coming in. If you only have 153 GB/sec of memory bandwidth, that means you'll cap out ~11 tokens/sec, regardless of how my processing power you have.

You can of course quantize to 8-bit or even 4-bit, or use a smaller model, but doing so makes your model dumber. There's a trade-off between performance and capability.


I think you mean GB/token


Err...yup. My bad. Can't edit it now.


You might be interested in LLM Systems which talks about how LLMs work at the hardware level and what optimizations can be done to improve the efficiency of them in this course: https://llmsystem.github.io/llmsystem2025spring/


The models (weights and activations and caches) can fill all the memory you have and more, and to a first (very rough) approximation every byte needs to be accessed for each token generated. You can see how that would add up.

I highly recommend Andrej Karpathy's videos if you want to learn details.


A very simplified version is: you need all the matrix to compute a matrix x vector operation, even if the vector is mostly zeroes. Edit: obviously my simplification is wrong but if you add up compression, etc… you get an idea.


Would you mind specifying which video(s)? He has quite a lot of content to consume.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: