Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

1 billion frames in memory... With such dataset, you have seen practically all realistic possibilities in the short-term.

If it would be able to invent action and maps and let the user play "infinite doom", then it would be very different (and impressive!).



Like many people in case of LLMs, you're just demonstrating unawareness of - or disbelief in - the fact that the model doesn't record training data vetbatim, but smears it out in high-dimensional space, from which it then samples. The model then doesn't recall past inputs (which are effectively under extreme lossy compression), but samples from that high-dimensional space to produce output. The high-dimensional representation by necessity captures semantic understanding of the training data.

Generating "infinite Doom" is exactly what this model is doing, as it does not capture the larger map layout well enough to stay consistent with it.


Whether or not a judge understands this will probably form the basis of any precedent set about the legality of image models and copyright.


I like "conditioned brute force" better term.


> 1 billion frames in memory... With such dataset, you have seen practically all realistic possibilities in the short-term.

I mean... no? Not even close? Multiply the number of game states with the number of inputs at any given frame gives you a number vastly bigger than 1 billion, not even comparable. Even with 20 days of play time to train no, it's entirely likely that at no point did someone stop at a certain location and look to the left from that angle. They might have done from similar angles, but the model then has to reconstruct some sense of the geometry of the level to synthesize the frame. They might also not have arrived there from the same direction, which again the model needs some smarts to understand.

I get your point, it's very overtrained on these particular levels of Doom, which means you might as well just play Doom. But this is not a hash table lookup we're talking about, it's pretty impressive work.


This was the basis for the reasoning:

The map 1 has 2'518 walkable map units. There are 65536 angles.

2'518*65'536=165'019'648

If you capture 165M frames, you already cover all the possibilities in terms of camera / player view, but probably the diffusion models don't even need to have all the frames (the same way that LLMs don't).


There's also enemy motion, enemy attacks, shooting, and UI considerations, which make the combinatorials explode.

And Doom movement isn't tile based. The map may be, but you can be in many many places on a tile.


Do you have to be exactly on a tile in Doom? I thought the guy walked smoothly around the map.


> I thought the guy walked smoothly around the map.

Correct. You are certainly not moving between the tiles as discrete units in doom.


I think enemy and effects are probably in there




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: