A couple of months in the past, my physician confirmed off an AI transcription device he used to document and summarize affected person conferences. In my case, the abstract was high quality, however researchers cited in this report by The Associated Press have discovered that’s not at all times the case for transcriptions created by OpenAI’s Whisper, which powers a device many hospitals use — generally it simply makes issues up totally.
Whisper is utilized by an organization called Nabla for a device that it estimates has transcribed 7 million medical conversations, in response to AP. Greater than 30,000 clinicians and 40 well being methods use it, the outlet writes. The report says that Nabla officers “are conscious that Whisper can hallucinate and are addressing the issue.” In a blog post published Monday, execs wrote that their mannequin contains enhancements to account for the “well-documented limitations of Whisper.”
In response to the researchers, “Whereas lots of Whisper’s transcriptions had been extremely correct, we discover that roughly one p.c of audio transcriptions contained complete hallucinated phrases or sentences which didn’t exist in any kind within the underlying audio… 38 p.c of hallucinations embrace specific harms comparable to perpetuating violence, making up inaccurate associations, or implying false authority.”
The researchers famous that “hallucinations disproportionately happen for people who communicate with longer shares of non-vocal durations,” which they mentioned is extra frequent for these with a language dysfunction referred to as aphasia. Lots of the recordings they used had been gathered from TalkBank’s AphasiaBank.
One of many researchers, Allison Koenecke of Cornell College, posted a thread concerning the examine displaying a number of examples like the one included above.
The researchers discovered that the AI-added phrases may embrace invented medical circumstances or phrases you may anticipate from a YouTube video, comparable to “Thanks for watching!” (OpenAI reportedly used to transcribe over a million hours of YouTube movies to coach GPT-4.)
OpenAI spokesperson Taya Christianson emailed an announcement to The Verge:
We take this subject critically and are frequently working to enhance, together with lowering hallucinations. For Whisper use on our API platform, our utilization insurance policies prohibit use in sure high-stakes decision-making contexts, and our mannequin card for open-source use contains suggestions in opposition to use in high-risk domains. We thank researchers for sharing their findings.
On Monday, Nabla CTO Martin Raison and machine studying engineer Sam Humeau printed a weblog publish titled “How Nabla uses Whisper.” Raison and Humeau say Nabla’s transcriptions are “circuitously included within the affected person document,” with a second layer of checking by a big language mannequin (LLM) question in opposition to the transcript and the context of the affected person and that “Solely details for which we discover definitive proof are thought-about legitimate.”
Additionally they say that it has processed “9 million medical encounters” and that “whereas some transcription errors had been generally reported, hallucination has by no means been reported as a major subject.”
Replace, October twenty eighth: Added weblog publish from Nabla.
Replace, October twenty ninth: Clarified that the Cornell College, and so forth. examine was peer-reviewed.
Correction, October twenty ninth: A earlier model of this story cited ABC Information. The story cited was printed by The Related Press, not ABC Information.