it's totally wild that your first response is shitting on flaws rather than having your jaw drop at machines producing coherent videos from text.
This is _the worst that machines will ever be at this task_, and most of the improvements that need to be made are a matter of engineering ingenuity, which can be translated to research dollars.
It certainly wasn't my intent to trash the whole thing, so I'm sorry it came across that way.
They've done well. They combined a whole bunch of techniques in a new way, or at least in a better way than we've seen before.
I don't think you should be surprised to see these results today.
> This is _the worst that machines will ever be at this task_
This is wrong. We've seen worse and we've seen far, far worse -- what I mean is that we've seen plenty of iterative development in video generation.
Even if you only consider machine-learning based video from text prompts.
Then consider other generative systems as well as other video research and technology like motion interpolation, depth map generation, etc..
It's an extremely active field.
This is _the worst that machines will ever be at this task_, and most of the improvements that need to be made are a matter of engineering ingenuity, which can be translated to research dollars.