Open source transcription models are already good enough to do this, and with good context engineering, the base models might be good enough, too.
It wouldn't be trivial to implement, but I think it's possible already.
Open source transcription models are already good enough to do this, and with good context engineering, the base models might be good enough, too.
It wouldn't be trivial to implement, but I think it's possible already.