But then you can’t just give the previous frame, with the LLM analogy you would have to give the last few thousand frames (that’s the context window, right?). If you only give the previous frame, that’s like having an LLM that only gets the single previous token and has to predict the next one.
Indeed. Although more recently they figured out a way to feed the hidden state as the new input, which basically allows the model to "continue thinking" in vectors without round-tripping it via words (or pixels).
Presumably if you were to take that and build a large enough NN to accommodate all the necessary state it needs to carry and all the rules it needs to be able to execute, then after training it on enough game input you'd have a proper world simulation. Of course, as the article rightly notes, then you have just successfully reimplemented Minecraft in a way that is orders of magnitude more computationally expensive...
Perhaps the trick used by text-based LLMs could be used: when the context window starts filling up, the LLM is asked to summarize the existing data in the context, thus compressing it (lossily..) into smaller space.