Just the other day there were news about how many mistakes the AI-assisted journaling caused – wrong names, wrong diagnoses, typos (buksmärta -> kuksmärta which is hilarious but serious).
Some of it is surely teething problems, but unless there is a robust check upon implementation it might just add another layer of inefficient new public management make-work to the system.
It feels to me like an autopilot problem in the making. "This thing means that you don't have to keep your eyes on the road - but please ensure you keep your eyes on the road, in case of errors"
The issue is, if you have any kind of rare condition, it already is this way. Much like the entire white side of a semi being presented to you across the road is a rare condition for autopilot, a huge number of 'rare' diseases already present problems for humans leading doctors and their staff to make errors by assuming the most likely condition. There is some saying like "It's probably a horse and not a zebra", but when it comes to hospitals zebras and even unicorns do show up, especially in the cases with recurring problems.
I found it interesting that your mind went to Tesla's autopilot. My mind went to operating airplanes. Most newer small planes have some form of GPS but you're technically not supposed to use instrument navigation until your certified to do so. I haven't met a single pilot that didn't do so, though.
Anyway, it creates the very problem you mentioned but just replace "road" with "outside the cockpit".
I don’t think we’re yet in a position where we can make claims about how inevitable certain outcomes are.
It is important to remind people that technology of any sort can be error prone and that human oversight should be relied on for any automated process, LLM based or not!
I work in the legal industry and every lawyer is aware of the guy who used ChatGPT to spit out non-existing case law!
> But that's what will inevitably happen at some point, when they get to the point of only rarely making big dangerous mistakes.
So the same as doctors making occasional big dangerous mistakes that cause lives. Seems like it would be a win then as it takes some mental load off of doctors so they can focus on where they should, on the patients and not on note taking.
I would assume you, the person behind the wheel of the car. Much the same the doctor/staff hitting the submit button on the validity of the records statement.
There's not much that a pilot can do when a plane is not working correctly. They can recognize the issue but they might not be in a position to do anything about it.
If an auto-form filler is not working correctly the doctor can also recognize the issue and also be in a position to do something about it, namely, fix the error before they submit the form.
That is to say that there's a world of a difference between a pilot flying a plane and a doctor filling out a form.
> …when they get to the point of only rarely making big dangerous mistakes.
Are you intentionally rubbing FUD on that or am I mis-reading you? I don’t think we need to wait to rely on technologies until they’ve achieved perfection - just when their mistakes are less frequent, less dangerous, or more predictable than human mistakes for the same task.
At that point, you've at least consulted with a medically trained professional who's licensed (which they have to regularly renew), has to complete annual CME, can be disciplined by a medical board, carries medical malpractice insurance, etc.
There should be requirements for any AI tool provider in the medical space to go through something like an IRB (https://en.wikipedia.org/wiki/Institutional_review_board) given they're fundamentally conducting medical experimentation on patients, and patients should have to consent to its use.
Exactly. If you have any kind of illness this is displaying atypical symptoms or otherwise may be rare, your life is in your own hands. Even something that is somewhat common like EDS can get you killed by doctors missing the signs. Keep a printout of all our own symptoms as they evolve over time, and immediately bring up anything that conflicts with what the doctor says.
thinking reading and doing edits as less work than entering it yourself is exactly what will cause critical errors to be made. It may not suggest that but just like there are Tesla drivers who are supposed to watch the road there are users who will not check. And in a medical record that can be deadly.
At the beginning they will be worse than humans and cause deaths that humans would have prevented while at the same time probably saving lives where a human would make a mistake.
But not far down the road they'll become much better than humans even if they do occasionally make a mistake and cause a death that a human wouldn't' have they'll save far more lives due to them not making the mistakes that humans do.
I think its basic certainty that computers will be better at transcribing notes than humans basically now and will continue to get better. I mean we're basically there now, I trust spell check and grammar check more than myself.
That's what we're talking about here. IMHO computers are already better at that, what possibly makes you think this won't happen?
Tangent here: really? I've found base Whisper has concerning error rates for non-US English accents; I imagine the same is true for other languages with a large regional mode to the source dataset.
Whisper + an LLM can recover some of the gaps by filling in contextually plausible bits, but then it's not a transcript and may contain hallucinations.
There are alternatives that share Whisper internal states with an LLM to improve ASR, as well as approaches that sample N-best hypotheses from Whisper and fine-tune an LLM to distill the hypotheses into a single output. Haven't looked too much into these yet given how expensive each component is to run independently.
Language detection in the presence of strong accents is, in my opinion, one of the most under-discussed biases in AI.
Traditional ASR systems struggle when English (or any language) is spoken with a heavy accent, often confusing it with another language. Whisper is also affected by this issue, as you noted.
The root of this problem lies in how language detection typically works. It relies on analyzing audio via MFCC (Mel Frequency Cepstrum Coefficient), a method inspired by human auditory perception.
MFCC is a part of the "psychoacoustic" field, focusing on how we perceive sound. It emphasizes lower frequencies and uses techniques like normalized Fourier decomposition to convert audio into a frequency spectrum.
However, this approach has a limitation: it's based purely on acoustics. So, if you speak English with a strong accent, the system may not understand the content but instead judge based on your prosody (rhythm, stress, intonation).
With the team at Gladia, we've developed a hybrid approach that combines psycho-acoustic features with content understanding for dynamic language detection.
In simple terms, our system doesn't just listen to how you speak but also understands what you're saying. This dual approach allows for efficient code-switching and doesn't let strong accents fall through the cracks. The system is based on optimized Whisper, among other models.
In the end, we managed to solve 99% of edge cases involving strong accents, despite the initial Whisper bias there. We've also worked a lot on hallucinations as a separate problem, which resulted in our proprietary model called Whisper-Zero.
If you want to give it a try, there's a free tier available. I'm happy to bounce around ideas on this topic any time; it's super fascinating to me.
Isn't the issue more that traditional ASR systems use US (General American) phonetic transcriptions so then struggle with accents that have different splits and mergers?
My understanding on whisper is that it is using a model trained on different accents, specifically from LibriVox. The quality would depend on the specific model selected.
The MFCC or other acoustic analysis is to detect the specific phonemes of speech. This is well understood (e.g. the first 3 formants corresponding to the vowels and their relative positions between speakers), and the inverse is used for a lot of the modern TTS engines where the MFCC is predicted and the waveform reconstructed from that (see e.g. https://pytorch.org/audio/stable/transforms.html).
Some words can change depending on adjacency to other words, or other speech phenomena (like H dropping) can alter the pronunciation of words. Then you have various homophones in different accents. All of these make it hard to go from the audio/phonetic representation to transcriptions.
This is in part why a relatively recent approach is to train the models on the actual spoken text and not the phonetics, so it can learn to disambiguate these issues. Note that this is not perfect, as TTS models like coqui-ai will often mispronounce words in different contexts as the result of a lack of training data or similar issues.
I'm wondering if it makes sense to train the models with the audio, phonetic transcriptions, and the text and score it on both phonetic and text accuracy. The idea being that it can learn what the different phonemes sound like and how they vary between speakers to try and stabilise the transcriptions and TTS output. The model would then be able to refer to both the audio and the phonemes when making the transcriptions, or for TTS to predict the phonemes then the phonemes as an additional input with the text to generate the audio -- i.e. it can use the text to infer things like prosody.
I've found that WhisperX with the medium model has been amazing at subtitling shows containing English dialects (British, Scottish, Australian, New Zealand-ish). It not only nails all the normal speech, but even gets the names and completely made up slang words. Interestingly you can tell it was trained from source material with dialects because it subtitles their particular spelling; so someone American will say color, and someone British will say colour.
I can't speak to how it performs outside of production quality audio, but in the hundreds of hours of subtitles that I've generated I don't think I've seen a single error.
IIUC, it's trained on LibriVox audio mainly, along with a few other sources. I'm not sure how it is handling spelling as the spelling will depend on the source content being read, unless the source text has been processed/edited to align with the dialect.
> There are alternatives that share Whisper internal states with an LLM to improve ASR, as well as approaches that sample N-best hypotheses from Whisper and fine-tune an LLM to distill the hypotheses into a single output. Haven't looked too much into these yet given how expensive each component is to run independently.
This is a situation where running within the context of the EMR and having access to the existing chart data is likely to make a lot of difference. My bigger concern is that there are a lot of different things being lumped together under "AI", and this is going to hit a bunch of different areas of machine learning.
The one time I used AI meeting notes, some important details were wrong. And beyond that, the notes were just terrible. A literal transcription can be useful. A summary of the substance of the meeting can be useful. This was neither. A human would know that a tangent talking about the weather is not important, but AI notes are just as likely to fill the document with "Chris mentioned it had rained yesterday but he was hoping to cook hamburgers when the sun comes out. Alice and Bob expressed opinions about side dishes, with the consensus being that fries are more appropriate than potato salad." as it was to miss a nuanced point that a human would have recorded because they understood the purpose of the meeting. And then it'd give me an action item to buy corn.
Some of it is surely teething problems, but unless there is a robust check upon implementation it might just add another layer of inefficient new public management make-work to the system.
https://sverigesradio.se/artikel/ai-journaler-i-sjukvarden-k...