Each word is taken to a distribution over words this is where the illusion of "context" largely comes from. eg., "cat" is replaced by a weighted: (cat, kitten, pet, mammal, ...) which is obtained via frequencies in a historical dataset.
So technically the LLM is not doing P(next word |previous word) -- but rather, P(associated_words(next word)|assocated_words(previous), associated_words(previous_-1), ...).
This means its search space for each conditional step is still extremely large in the historical corpus, and there's more flexibility to reach "across and between contexts" -- but it isnt sensitive to context.. we just arranged the data that way.
Soon enough people with enough money will build diagnostic (XAI) models of LLMs that are powerful enough to show this process at work over its training data.
To visualize roughly, imagine you're in a library and you're asked a question. The first word selects a very large number of pages across many books (and whole books), the second word selects both other books, and pages across the books you have. Keep going.. each more word you're ask, you convert to a set of words, and find more pages and books and also get narrower paragraph samples from the ones you have. Now finally, with total set of pages and paragraphs etc. you have to hand at the end of the question, you then find the word most probable following the other.
This process will eventually be visualised properly, with a real-world LLM, but it'll take a significant investement to build this sort of explanatory model.. since you need to reverse from weights to training data across the entire inference process.
So technically the LLM is not doing P(next word |previous word) -- but rather, P(associated_words(next word)|assocated_words(previous), associated_words(previous_-1), ...).
This means its search space for each conditional step is still extremely large in the historical corpus, and there's more flexibility to reach "across and between contexts" -- but it isnt sensitive to context.. we just arranged the data that way.
Soon enough people with enough money will build diagnostic (XAI) models of LLMs that are powerful enough to show this process at work over its training data.
To visualize roughly, imagine you're in a library and you're asked a question. The first word selects a very large number of pages across many books (and whole books), the second word selects both other books, and pages across the books you have. Keep going.. each more word you're ask, you convert to a set of words, and find more pages and books and also get narrower paragraph samples from the ones you have. Now finally, with total set of pages and paragraphs etc. you have to hand at the end of the question, you then find the word most probable following the other.
This process will eventually be visualised properly, with a real-world LLM, but it'll take a significant investement to build this sort of explanatory model.. since you need to reverse from weights to training data across the entire inference process.