Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Just the other day there were news about how many mistakes the AI-assisted journaling caused – wrong names, wrong diagnoses, typos (buksmärta -> kuksmärta which is hilarious but serious).

Some of it is surely teething problems, but unless there is a robust check upon implementation it might just add another layer of inefficient new public management make-work to the system.

https://sverigesradio.se/artikel/ai-journaler-i-sjukvarden-k...



It feels to me like an autopilot problem in the making. "This thing means that you don't have to keep your eyes on the road - but please ensure you keep your eyes on the road, in case of errors"


The issue is, if you have any kind of rare condition, it already is this way. Much like the entire white side of a semi being presented to you across the road is a rare condition for autopilot, a huge number of 'rare' diseases already present problems for humans leading doctors and their staff to make errors by assuming the most likely condition. There is some saying like "It's probably a horse and not a zebra", but when it comes to hospitals zebras and even unicorns do show up, especially in the cases with recurring problems.


I found it interesting that your mind went to Tesla's autopilot. My mind went to operating airplanes. Most newer small planes have some form of GPS but you're technically not supposed to use instrument navigation until your certified to do so. I haven't met a single pilot that didn't do so, though.

Anyway, it creates the very problem you mentioned but just replace "road" with "outside the cockpit".


> you're technically not supposed to use instrument navigation until your certified to do so

You can use them all you like. You just can't fly in conditions where you have to use them.


> but you're technically not supposed to use instrument navigation until your certified to do so

What do you mean by this? Not having an IFR rating does not mean you're not allowed to use the navigation aids or the plane's autopilot.


Reading and doing minor edits is much less of a cognitive load than writing.

The article does not suggest that doctors should blindly trust the SOAP note created by the tool in question.


> The article does not suggest that doctors should blindly trust the SOAP note created by the tool in question.

But that's what will inevitably happen at some point, when they get to the point of only rarely making big dangerous mistakes.


I don’t think we’re yet in a position where we can make claims about how inevitable certain outcomes are.

It is important to remind people that technology of any sort can be error prone and that human oversight should be relied on for any automated process, LLM based or not!

I work in the legal industry and every lawyer is aware of the guy who used ChatGPT to spit out non-existing case law!


Apparently not enough to prevent it from being repeated... One of Trump's former lawyers did it months after the first case made national news.

https://text.npr.org/2023/12/30/1222273745/michael-cohen-ai-...


> But that's what will inevitably happen at some point, when they get to the point of only rarely making big dangerous mistakes.

So the same as doctors making occasional big dangerous mistakes that cause lives. Seems like it would be a win then as it takes some mental load off of doctors so they can focus on where they should, on the patients and not on note taking.


> So the same as doctors making occasional big dangerous mistakes that cause lives.

Will it be? There are already unanswered questions on who's liable if Tesla's FSD runs you into someone.


I would assume you, the person behind the wheel of the car. Much the same the doctor/staff hitting the submit button on the validity of the records statement.


So Boeing should not have any liability for the MCAS crashes, because the pilots were in control?


There's not much that a pilot can do when a plane is not working correctly. They can recognize the issue but they might not be in a position to do anything about it.

If an auto-form filler is not working correctly the doctor can also recognize the issue and also be in a position to do something about it, namely, fix the error before they submit the form.

That is to say that there's a world of a difference between a pilot flying a plane and a doctor filling out a form.


Isn't the vast majority of the world using computers without ECC memory already blindly trusting there are no bit flips causing silent corruption?


> …when they get to the point of only rarely making big dangerous mistakes.

Are you intentionally rubbing FUD on that or am I mis-reading you? I don’t think we need to wait to rely on technologies until they’ve achieved perfection - just when their mistakes are less frequent, less dangerous, or more predictable than human mistakes for the same task.


I'm saying getting close to perfection is truly dangerous territory, because everyone gets very complacent at that point.

As a concrete example: https://www.cbsnews.com/news/pilots-fall-asleep-mid-flight-1...


We are already there with humans. Most people take a doctor at their word and don't bother to get a second opinion.


At that point, you've at least consulted with a medically trained professional who's licensed (which they have to regularly renew), has to complete annual CME, can be disciplined by a medical board, carries medical malpractice insurance, etc.

There should be requirements for any AI tool provider in the medical space to go through something like an IRB (https://en.wikipedia.org/wiki/Institutional_review_board) given they're fundamentally conducting medical experimentation on patients, and patients should have to consent to its use.


In the context described, it's acting as a tool for a doctor. AI scribes are not conducting experiments.


The use of the AI to treat patients is a medical experiment.


any change to the practice is an experiment.


Exactly. If you have any kind of illness this is displaying atypical symptoms or otherwise may be rare, your life is in your own hands. Even something that is somewhat common like EDS can get you killed by doctors missing the signs. Keep a printout of all our own symptoms as they evolve over time, and immediately bring up anything that conflicts with what the doctor says.


thinking reading and doing edits as less work than entering it yourself is exactly what will cause critical errors to be made. It may not suggest that but just like there are Tesla drivers who are supposed to watch the road there are users who will not check. And in a medical record that can be deadly.


> typos (buksmärta -> kuksmärta which is hilarious but serious)

To save people looking it up, that one-char difference changes "abdominal pain" into "cock pain".

Wow.


Well, they are close enough to each other, aren't they ? /s

That is the main problem with the AI: it is close enough, but never there.


This seems very similar to self driving cars.

At the beginning they will be worse than humans and cause deaths that humans would have prevented while at the same time probably saving lives where a human would make a mistake.

But not far down the road they'll become much better than humans even if they do occasionally make a mistake and cause a death that a human wouldn't' have they'll save far more lives due to them not making the mistakes that humans do.


But not far down the road they'll become much better than humans

While I think you're correct, there is no proof this will ever be achieved.

It very well may not be possible along our current path. It may take a 100 years, 1000 to get there.

And yes it could take only 20 more. But to state this as a certainty?

No.


I think its basic certainty that computers will be better at transcribing notes than humans basically now and will continue to get better. I mean we're basically there now, I trust spell check and grammar check more than myself.

That's what we're talking about here. IMHO computers are already better at that, what possibly makes you think this won't happen?


My reply was to the parent post, which started:

This seems very similar to self driving cars.

And continued discussing self driving cars.

Regardless, what I said stands.


ah, then I mis-interpreted. My fault, appreciate you letting me know.


I wouldn't be surprised if Region Blekinge were using something much worse and much more expensive than Whisper for their transcription.

I've been transcribing A LOT of SR (Swedish Radio) shows as part of https://nyheter.sh/, and Whisper (self-hosted) has been very accurate.


Tangent here: really? I've found base Whisper has concerning error rates for non-US English accents; I imagine the same is true for other languages with a large regional mode to the source dataset.

Whisper + an LLM can recover some of the gaps by filling in contextually plausible bits, but then it's not a transcript and may contain hallucinations.

There are alternatives that share Whisper internal states with an LLM to improve ASR, as well as approaches that sample N-best hypotheses from Whisper and fine-tune an LLM to distill the hypotheses into a single output. Haven't looked too much into these yet given how expensive each component is to run independently.


Language detection in the presence of strong accents is, in my opinion, one of the most under-discussed biases in AI.

Traditional ASR systems struggle when English (or any language) is spoken with a heavy accent, often confusing it with another language. Whisper is also affected by this issue, as you noted.

The root of this problem lies in how language detection typically works. It relies on analyzing audio via MFCC (Mel Frequency Cepstrum Coefficient), a method inspired by human auditory perception.

MFCC is a part of the "psychoacoustic" field, focusing on how we perceive sound. It emphasizes lower frequencies and uses techniques like normalized Fourier decomposition to convert audio into a frequency spectrum.

However, this approach has a limitation: it's based purely on acoustics. So, if you speak English with a strong accent, the system may not understand the content but instead judge based on your prosody (rhythm, stress, intonation).

With the team at Gladia, we've developed a hybrid approach that combines psycho-acoustic features with content understanding for dynamic language detection.

In simple terms, our system doesn't just listen to how you speak but also understands what you're saying. This dual approach allows for efficient code-switching and doesn't let strong accents fall through the cracks. The system is based on optimized Whisper, among other models.

In the end, we managed to solve 99% of edge cases involving strong accents, despite the initial Whisper bias there. We've also worked a lot on hallucinations as a separate problem, which resulted in our proprietary model called Whisper-Zero.

If you want to give it a try, there's a free tier available. I'm happy to bounce around ideas on this topic any time; it's super fascinating to me.


Isn't the issue more that traditional ASR systems use US (General American) phonetic transcriptions so then struggle with accents that have different splits and mergers?

My understanding on whisper is that it is using a model trained on different accents, specifically from LibriVox. The quality would depend on the specific model selected.

The MFCC or other acoustic analysis is to detect the specific phonemes of speech. This is well understood (e.g. the first 3 formants corresponding to the vowels and their relative positions between speakers), and the inverse is used for a lot of the modern TTS engines where the MFCC is predicted and the waveform reconstructed from that (see e.g. https://pytorch.org/audio/stable/transforms.html).

Some words can change depending on adjacency to other words, or other speech phenomena (like H dropping) can alter the pronunciation of words. Then you have various homophones in different accents. All of these make it hard to go from the audio/phonetic representation to transcriptions.

This is in part why a relatively recent approach is to train the models on the actual spoken text and not the phonetics, so it can learn to disambiguate these issues. Note that this is not perfect, as TTS models like coqui-ai will often mispronounce words in different contexts as the result of a lack of training data or similar issues.

I'm wondering if it makes sense to train the models with the audio, phonetic transcriptions, and the text and score it on both phonetic and text accuracy. The idea being that it can learn what the different phonemes sound like and how they vary between speakers to try and stabilise the transcriptions and TTS output. The model would then be able to refer to both the audio and the phonemes when making the transcriptions, or for TTS to predict the phonemes then the phonemes as an additional input with the text to generate the audio -- i.e. it can use the text to infer things like prosody.


>Traditional ASR systems struggle when English (or any language) is spoken with a heavy accent, often confusing it with another language.

Humans also have difficulty with heavy accents, no?


True, but we can notice that that's the case, and then try to listen more carefully or ask for clarification


I've found that WhisperX with the medium model has been amazing at subtitling shows containing English dialects (British, Scottish, Australian, New Zealand-ish). It not only nails all the normal speech, but even gets the names and completely made up slang words. Interestingly you can tell it was trained from source material with dialects because it subtitles their particular spelling; so someone American will say color, and someone British will say colour.

I can't speak to how it performs outside of production quality audio, but in the hundreds of hours of subtitles that I've generated I don't think I've seen a single error.


IIUC, it's trained on LibriVox audio mainly, along with a few other sources. I'm not sure how it is handling spelling as the spelling will depend on the source content being read, unless the source text has been processed/edited to align with the dialect.


> There are alternatives that share Whisper internal states with an LLM to improve ASR, as well as approaches that sample N-best hypotheses from Whisper and fine-tune an LLM to distill the hypotheses into a single output. Haven't looked too much into these yet given how expensive each component is to run independently.

Any recommended ones you've looked at?


This is a situation where running within the context of the EMR and having access to the existing chart data is likely to make a lot of difference. My bigger concern is that there are a lot of different things being lumped together under "AI", and this is going to hit a bunch of different areas of machine learning.


This would be a major concern for me.

The one time I used AI meeting notes, some important details were wrong. And beyond that, the notes were just terrible. A literal transcription can be useful. A summary of the substance of the meeting can be useful. This was neither. A human would know that a tangent talking about the weather is not important, but AI notes are just as likely to fill the document with "Chris mentioned it had rained yesterday but he was hoping to cook hamburgers when the sun comes out. Alice and Bob expressed opinions about side dishes, with the consensus being that fries are more appropriate than potato salad." as it was to miss a nuanced point that a human would have recorded because they understood the purpose of the meeting. And then it'd give me an action item to buy corn.


> Some of it is surely teething problems...

Just gonna adjust the temperature of your baby here, ok, now he should grow up just fine.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: