Do you mean that it wouldn't help with ingesting footage and then determining ho...

Do you mean that it wouldn't help with ingesting footage and then determining how to act?

I can imagine a robotics architecture where you have one model generating footage (next frames for what it is currently seeing) and another dumber model which takes in the generated footage and only knows how to generate the motor/servo control outputs needed to control whatever robot platform it is integrated with.

I think that kind of architecture decoupling would be nice. It allows the model with all the world and task-specific knowledge to be agnostic from its underlying robot platform.