When working in a linguistics lab as an undergraduate long ago, we looked at spectrograms to identify sounds (specifically places of articulation) as much as listened to recordings.
So it makes some sense to build a model on them rather than some other representation of the sound.
I had friends in MIT's computational linguistics group back in the 1980s who did a casual get together where they'd take turns handing out spectrograms of human speech and for the rest of the group to try and interpret. Apparently this started with some noted researcher asserting that you couldn't interpret a spectrogram faster than it was originally spoken, and one of them decided to learn them well enough to disprove it by example, which turned out to be easy but inspired them to continue by trying to make more challenging ones - culminating in getting John Moschitta Jr. to record for them :-)
When working in a linguistics lab as an undergraduate long ago, we looked at spectrograms to identify sounds (specifically places of articulation) as much as listened to recordings.
So it makes some sense to build a model on them rather than some other representation of the sound.