The scaling is brutal. If you have a 20k word vocabulary and want to do 3 grams,...

andrewla · 2025-03-17T20:50:43 1742244643

Importantly, though, LLMs do not take the embeddings as input during training; they take the tokens and learn the embeddings as part of the training.

Specifically all Transformer-based models; older models used things like word2vec or elmo, but all current LLMs train their embeddings from scratch.

naasking · 2025-03-17T22:54:26 1742252066

And tokens are now going down to the byte level:

https://ai.meta.com/research/publications/byte-latent-transf...

YesBox · 2025-03-18T00:16:25 1742256985

You shouldn't need to allocate every possible combination !_! if you dynamically add new pairs/distance as you find them. Im talkin simple for loops.

currymj · 2025-03-18T02:34:11 1742265251

you might enjoy this read, which is an up-to-date document from this year laying out what was the state of the art 20 years ago:

https://web.stanford.edu/~jurafsky/slp3/3.pdf

Essentially you just count every n-gram that's actually in the corpus, and "fill in the blanks" for all the 0s with some simple rules for smoothing out the probability.