Maybe two factors helped achieve the impressive result with the quotation marks:...

Maybe two factors helped achieve the impressive result with the quotation marks:

- auditory cues

- the sentence would be gramatically incorrect and make no sense without them

Just guessing out of the blue.

But I think it's likely that LLMs (and other speech recognition systems) need to exploit sentence context to recognize individual words and punctuation, and this is an example were it went well.

Human listening is similar in a way, we can recognize words even when spoken very mumbly or fast, if we have context.

So we always hear phrased rather than words.