No? He’s talking about rendered text

rhdunn · 2025-10-22T22:23:51 1761171831

From the post he's referring to text input as well:

> Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in:

Italicized emphasis mine.

So he's suggesting that/wondering if the vision model should be the only input to the LLM and have that read the text. So there would be a rasterization step on the text input to generate the image.

Thus, you don't need to draw a picture but generate a raster of the text to feed it to the vision model.