But then you can’t just give the previous frame, with the LLM analogy you would ...

int_19h · 2025-01-23T09:14:51 1737623691

Indeed. Although more recently they figured out a way to feed the hidden state as the new input, which basically allows the model to "continue thinking" in vectors without round-tripping it via words (or pixels).

Presumably if you were to take that and build a large enough NN to accommodate all the necessary state it needs to carry and all the rules it needs to be able to execute, then after training it on enough game input you'd have a proper world simulation. Of course, as the article rightly notes, then you have just successfully reimplemented Minecraft in a way that is orders of magnitude more computationally expensive...

_flux · 2025-01-23T09:36:15 1737624975

Perhaps the trick used by text-based LLMs could be used: when the context window starts filling up, the LLM is asked to summarize the existing data in the context, thus compressing it (lossily..) into smaller space.