If you dig ml/vision papers from old, you will see that formulation-wise they actually did, but they lacked the data, compute, and the mechanistic machinery provided by the transformer architecture. The wheels of progress are slow and requires many rotations to finally reach somewhere.