Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

By the “delta in the LLM weights”, I am assuming you mean the gradients. You are effectively describing large batch training (data parallelism) which is part of the way you can scale up but there are quickly diminishing returns to large batch sizes.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: