A diffusion model cannot be a game engine because a game engine can be used to create new games and modify the rules of existing games in real time -- even rules which are not visible on-screen.
These tools are fascinating but, as with all AI hype, they need a disclaimer: The tool didn't create the game. It simply generated frames and the appearance of play mechanics from a game it sampled (which humans created).
If someone told you 10 years ago that they were going to create something where you could play a whole new level of Doom, without them writing a single line of game logic/rendering code, would you say that that is simpler than creating a demo by writing the game themselves?
There are two things at play here: the complexity of the underlying mechanism, and the complexity of detailed creation. This is obviously a complicated mechanism, but in another sense it's a trivial result compared to actually reproducing the game itself in its original intended state.
They only trained it on one game and only embedded the control inputs. You could train it on many games and embed a lot more information about each of them which could possibly allow you to specify a prompt that would describe the game and then play it.
One thing I'd like to see is to take a game rendered with low poly assets (or segmented in some way) and use a diffusion model to add realistic or stylized art details. This would fix the consistency problem while still providing tangible benefits.
All video games are, by definition, interactive videos.
What I imagine you're asking about is, a typical game like Doom is effectively a function:
f(internal state, player input) -> (new frame, new internal state)
where internal state is the shape and looks of loaded map, positions and behaviors and stats of enemies, player, items, etc.
A typical AI that plays Doom, which is not what's happening here, is (at runtime):
f(last frame) -> new player input
and is attached in a loop to the previous case in the obvious way.
What we have here, however, is a game you can play but implemented in a diffusion model, and it
works like this:
f(player input, N last frames) -> new frame
Of note here is the lack of game state - the state is implicit in the contents of the N previous frames, and is otherwise not represented or mutated explicitly. The diffusion model has seen so much Doom that it, in a way, internalized most of the state and its evolution, so it can look at what's going on and guess what's about to happen. Which is what it does: it renders the next frame by predicting it, based on current user input and last N frames. And then that frame becomes the input for the next prediction, and so on, and so on.
So yes, it's totally an interactive video and a game and a third thing - a probabilistic emulation of Doom on a generative ML model.
Making an interactive video of it. It is not playing the game, a human does that.
With that said, I wholly disagree that this is not an engine. This is absolutely a game engine and while this particular demo uses the engine to recreate DOOM, an existing game, you could certainly use this engine to produce new games in addition to extrapolating existing games in novel ways.
These tools are fascinating but, as with all AI hype, they need a disclaimer: The tool didn't create the game. It simply generated frames and the appearance of play mechanics from a game it sampled (which humans created).