Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The scaling is brutal. If you have a 20k word vocabulary and want to do 3 grams, you need a 20000^3 matrix of elements (8 trillion). Most of which is going to be empty.

GPT and friends cheat by not modeling each word separately, but a large dimensional “embedding” (just a vector if you also find new vocabulary silly). The embedding represents similar words near each other in this space. The famous king-man-queen example. So even if your training set has never seen “The Queen ordered the traitor <blank>”, it might have previously seen, “The King ordered the traitor beheaded”. The vector representation lets the model use words that represent similar concepts without concrete examples.



Importantly, though, LLMs do not take the embeddings as input during training; they take the tokens and learn the embeddings as part of the training.

Specifically all Transformer-based models; older models used things like word2vec or elmo, but all current LLMs train their embeddings from scratch.


And tokens are now going down to the byte level:

https://ai.meta.com/research/publications/byte-latent-transf...


You shouldn't need to allocate every possible combination !_! if you dynamically add new pairs/distance as you find them. Im talkin simple for loops.


you might enjoy this read, which is an up-to-date document from this year laying out what was the state of the art 20 years ago:

https://web.stanford.edu/~jurafsky/slp3/3.pdf

Essentially you just count every n-gram that's actually in the corpus, and "fill in the blanks" for all the 0s with some simple rules for smoothing out the probability.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: