Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

After some discussion in this thread, I found it worth pointing out that this paper is NOT describing a system which receives real-time user input and adjusts its output accordingly, but, to me, the way the abstract is worded heavily implied this was occurring.

It's trained on a large set of data in which agents played DOOM and video samples are given to users for evaluation, but users are not feeding inputs into the simulation in real-time in such a way as to be "playing DOOM" at ~20FPS.

There are some key phrases within the paper that hint at this such as "Key questions remain, such as ... how games would be effectively created in the first place, including how to best leverage human inputs" and "Our end goal is to have human players interact with our simulation.", but mostly it's just the omission of a section describing real-time user gameplay.



We can't assess the quality of gameplay ourselves of course (since the model wasn't released), but one author said "It's playable, the videos on our project page are actual game play." (https://x.com/shlomifruchter/status/1828850796840268009) and the video on top of https://gamengen.github.io/ starts out with "these are real-time recordings of people playing the game". Based on those claims, it seems likely that they did get a playable system in front of humans by the end of the project (though perhaps not by the time the draft was uploaded to arXiv).


I also thought this, but refer back to the paper, not the abstract:

> A is the set of key presses and mouse movements…

> …to condition on actions, we simply learn an embedding A_emb for each action

So, it’s clear that in this model the diffusion process is conditioned by embedding A that is derived from user actions rather than words.

Then a noised start frame is encoded into latents and concatenated on to the noise latents as a second conditioning.

So we have a diffusion model which is trained solely on images of doom, and which is conditioned on current doom frames and user actions to produce subsequent frames.

So yes, the users are playing it.

However, it should be unsurprising that this is possible. This is effectively just a neural recording of the game. But it’s a cool tech demo.


The agent never interacts with the simulator during training or evaluation. There is no user, there is only an agent which trained to play the real game and which produced the sequences of game frames and actions that were used to train the simulator and to provide ground truth sequences of game experience for evaluation. Their evaluation metrics are all based on running short simulations in the diffusion model which are initiated with some number of conditioning frames taken from the real game engine. Statements in the paper like: "GameNGen shows that an architecture and model weights exist such that a neural model can effectively run a complex game (DOOM) interactively on existing hardware." are wildly misleading.


I wonder if they could somehow feed in a trained Gaussian splats model to this to get better images?

Since the splats are specifically designed for rendering it seems like it would be an efficient way for the image model to learn the geometry without having to encode it on the image model itself.


I’m not sure how that would help vs just training the model with the conditionings described in the paper.

I’m not very familiar with Gaussian splats models, but aren’t they just a way of constructing images using multiple superimposed parameterized Gaussian distributions, sort of like the Fourier series does with waveforms using sine and cosine waves?

I’m not seeing how that would apply here but I’d be interested in hearing how you would do it.


I'm not certain where it would fit in, but my thinking is this.

There's been a bunch of work on making splats efficient and good at representing geometry. Reading more, perhaps NERFs would be a better fit, since they're an actual neutral network.

My thinking is that if you trained a NERF ahead of time to represent the geometry and layout of the levels, and plug that in to the diffusion model (as a part of computing the latents, and then also on the other side so it can be used to improve the rendering) then the diffusion model could focus on learning how actions manipulate the world without having to learn the geometry representation.


I don’t know if that would really help, I have a hard time imagining exactly what that model would be doing in practise.

To be honest none of the stuff in the paper is very practical, you almost certainly do not want a diffusion model trying to be an entire game under any circumstances.

What you might want to do is use a diffusion model to transform a low poly, low fidelity game world into something photorealistic. So the geometry, player movement and physics etc would all make sense, and then the model paints over it something that looks like reality based on some primitive texture cues in the low fidelity render.

I’d bet money that something like that will happen and it is the future of games and video.


Yeah, I realize this will never be useful for much in practice (although maybe as some kind of client side prediction for cloud gaming? But likely if you could run this in real time you might as well run whatever game there is in real time as well, unless there's some kind of massive world running on the server that's too large to stream the geometry for effectively), I was mostly just trying to think of a way to avoid the issues with fake looking frames or forgetting what the level looks like when you turn around that someone mentioned.

Not exactly that, but Nvidia does something like this already, they call it DLSS. It uses previous frames and motion vector to render a next frame using machine learning.


The paper should definitely be more clear on this point, but there's a sentence in section 5.2.3 that makes me think that this was playable and played: "When playing with the model manually, we observe that some areas are very easy for both, some areas are very hard for both, and in some the agent performs much better." It may be a failure of imagination, but I can't think of another reasonable way of interpreting "playing with the model manually".


What you're describing reminded me of this cool project:

https://www.youtube.com/watch?v=udPY5rQVoW0 "Playing a Neural Network's version of GTA V: GAN Theft Auto"


You are incorrect, this is an interactive simulation that is playable by humans.

> Figure 1: a human player is playing DOOM on GameNGen at 20 FPS.

The abstract is ambiguously worded which has caused a lot of confusion here, but the paper is unmistakably clear about this point.

Kind of disappointing to see this misinformation upvoted so highly on a forum full of tech experts.


If the generative model/simulator can run at 20FPS, then obviously in principle a human could play the game in simulation at 20 FPS. However, they do no evaluation of human play in the paper. My guess is that they limited human evals to watching short clips of play in the real engine vs the simulator (which conditions on some number of initial frames from the engine when starting each clip...) since the actual "playability" is not great.


Yeah. If isn't doing this, then what could it be doing that is worth a paper? "real-time user input and adjusts its output accordingly"


There is a hint in the paper itself:

It says in a shy way that it is based on: "Ha & Schmidhuber (2018) who train a Variational Auto-Encoder (Kingma & Welling, 2014) to encode game frames into a latent vector"

So it means they most likely took https://worldmodels.github.io/ (that is actually open-source) or something similar and swapped the frame generation by Stable Diffusion that was released in 2022.


>I found it worth pointing out that this paper is NOT describing a system which receives real-time user input and adjusts its output accordingly

Well you're wrong as specified in the first video and by the authors themselves, maybe next time check better instead of writing comments with such authoritative tone of things you don't actually know.


I think someone is playing it, but it has a reduced set of inputs and they're playing it in a very specific way (slowly, avoiding looking back to places they've been) so as not to show off the flaws in the system.

The people surveyed in this study are not playing the game, they are watching extremely short video clips of the game being played and comparing them to equally short videos of the original Doom being played, to see if they can spot the difference.

I may be wrong with how it works, but I think this is just hallucinating in real time. It has no internal state per se, it knows what was on screen in the previous few frames and it knows what inputs the user is pressing, and so it generates the next frame. Like with video compression, it probably doesn't need to generate a full frame every time, just "differences".

As with all the previous AI game research, these are not games in any real sense. They fall apart when played beyond any meaningful length of time (seconds). Crucially, they are not playable by anyone other than the developers in very controlled settings. A defining attribute of any game is that it can be played.


The movement of the player seems jittery a bit so I inferred something similar on that basis.


Were the agents playing at 20 real FPS, or did this occur like a Pixar movie offline?


Ehhh okay, I'm not as convinced as I was earlier. Sorry for misleading. There's been a lot of back-and-forth.

I would've really liked to see a section of the paper explicitly call out that they used humans in real time. There's a lot of sentences that led me to believe otherwise. It's clear that they used a bunch of agents to simulate gameplay where those agents submitted user inputs to affect the gameplay and they captured those inputs in their model. This made it a bit murky as to whether humans ever actually got involved.

This statement, "Our end goal is to have human players interact with our simulation. To that end, the policy π as in Section 2 is that of human gameplay. Since we cannot sample from that directly at scale, we start by approximating it via teaching an automatic agent to play"

led me to believe that while they had an ultimate goal of user input (why wouldn't they) they sufficed by approximating human input.

I was looking to refute that assumption later in the paper by hopefully reading some words on the human gameplay experience, but instead, under Results, I found:

"Human Evaluation. As another measurement of simulation quality, we provided 10 human raters with 130 random short clips (of lengths 1.6 seconds and 3.2 seconds) of our simulation side by side with the real game. The raters were tasked with recognizing the real game (see Figure 14 in Appendix A.6). The raters only choose the actual game over the simulation in 58% or 60% of the time (for the 1.6 seconds and 3.2 seconds clips, respectively)."

and it's like.. okay.. if you have a section in results on human evaluation, and your goal is to have humans play, then why are you talking just about humans reviewing video rather than giving some sort of feedback on the human gameplay experience - even if it's not especially positive?

Still, in the Discussion section, it mentions, "The second important limitation are the remaining differences between the agent’s behavior and those of human players. For example, our agent, even at the end of training, still does not explore all of the game’s locations and interactions, leading to erroneous behavior in those cases." which makes it more clear that humans gave input which went outside the bounds of the automatic agents. It doesn't seem like this would occur if it were agents simulating more input.

Ultimately, I think that the paper itself could've been more clear in this regard, but clearly the publishing website tries to be very explicit by saying upfront - "Real-time recordings of people playing the game DOOM" and it's pretty hard to argue against that.

Anyway. I repent! It was a learning experience going back and forth on my belief here. Very cool tech overall.


It's funny how academic writing works. Authors rarely produce many unclear or ambiguous statements where the most likely interpretation undersells their work...


I knew it was too good be true but seems like real time video generation can be good enough to get to a point where it feels like a truly interactive video/game

Imagine if text2game was possible. there would be some sort of network generating each frame from an image generated by text, with some underlying 3d physics simulation to keep all the multiplayer screens sync'd

this paper does not seem to be of that possibility rather some cleverly words to make you think people were playing a real time video. we can't even generate more than 5~10 second of video without it hallucinating. something this persistent would require an extreme amount of gameplay video training. it can be done but the video shown by this paper is not true to its words.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: