Training a high quality image is different than training a high quality image of...

Training a high quality image is different than training a high quality image of the specific thing you want. One has substantially more flexibility. It is ridiculous to compare the two.

As for humans, I will give the constant reminder. There are 3 parts to language: 1) what you intend to convey (what's in your head), 2) what you actually say/write (encoding head to physical), 3) what the other person understands (decoding physical to mental). These 3 things can have 3 different meanings. Do your best in the first two, but the third requires the other person to be acting in good faith.