ChatGPT recently passed the U.S. Medical Licensing Exam, but using it for a real-world medical diagnosis would quickly turn deadly.
: Josh’s replies to key reader comments have been added at the tip of this post.
With news that ChatGPT successfully “passed” the U.S. Medical Licensing Exam, I used to be curious how it could perform in a real-world medical situation. As an advocate of leveraging artificial intelligence to enhance the standard and efficiency of healthcare, I desired to see how the present version of ChatGPT might function a tool in my very own practice.
So after my regular clinical shifts within the emergency department the opposite week, I anonymized my History of Present Illness notes for 35 to 40 patients — mainly, my detailed medical narrative of all and sundry’s medical history, and the symptoms that brought them to the emergency department — and fed them into ChatGPT.
The precise prompt I used was, “What are the differential diagnoses for this patient presenting to the emergency department [insert patient HPI notes here]?”
The outcomes were fascinating, but additionally fairly disturbing.
OpenAI’s chatbot did an honest job of bringing up common diagnoses I wouldn’t wish to miss — so long as every thing I told it was precise, and highly detailed. Accurately diagnosing a patient as having nursemaid’s elbow, as an example, required about 200 words; identifying one other patient’s orbital wall blowout fracture took all the 600 words of my HPI on them.
For roughly half of my patients, ChatGPT suggested six possible diagnoses, and the “right” diagnosis — or at the least the diagnosis that I believed to be right after complete evaluation and testing — was among the many six that ChatGPT suggested.
Not bad. On the other hand, a 50% success rate within the context of an emergency room can be not good.
ChatGPT’s worst performance happened with a 21-year-old female patient who got here into the ER with right lower quadrant abdominal pain. I fed her HPI into ChatGPT, which immediately got here back with a differential diagnosis of appendicitis or an ovarian cyst, amongst other possibilities.
But ChatGPT missed a somewhat vital diagnosis with this woman.
She had an ectopic pregnancy, through which a malformed fetus develops in a lady’s fallopian tube, and never her uterus. Diagnosed too late, it may possibly be fatal — leading to death brought on by internal bleeding. Fortunately for my patient, we were in a position to rush her into the operating room for immediate treatment.
Notably, when she saw me within the emergency room, this patient didn’t even know she was pregnant. This will not be an atypical scenario, and sometimes only emerges after some gentle inquiring:
“Any likelihood you’re pregnant?”
Sometimes a patient will reply with something like “I can’t be.”
“But how do you understand?”
If the response to that follow-up doesn’t check with an IUD or a selected medical condition, it’s more likely the patient is definitely saying they don’t want to be pregnant for any variety of reasons. (Infidelity, trouble with the family, or other external aspects.) Again, this will not be an unusual scenario; about 8% of pregnancies discovered within the ER are of girls who report that they’re not sexually energetic.
But searching through ChatGPT’s diagnosis, I noticed not a single thing in its response suggested my patient was pregnant. It didn’t even know to ask.
My fear is that countless individuals are already using ChatGPT to medically diagnose themselves quite than see a physician. If my patient on this case had done that, ChatGPT’s response could have killed her.
ChatGPT also misdiagnosed several other patients who had life-threatening conditions. It accurately suggested one in every of them had a brain tumor — but missed two others who also had tumors. It diagnosed one other patient with torso pain as having a kidney stone — but missed that the patient actually had an aortic rupture. (And subsequently died on our operating table.)
In brief, ChatGPT worked pretty much as a diagnostic tool once I fed it perfect information and the patient had a classic presentation.
This is probably going why ChatGPT “passed” the case vignettes within the Medical Licensing Exam. Not since it’s “smart,” but since the classic cases within the exam have a deterministic answer that already exists in its database. ChatGPT rapidly presents answers in a natural language format (that’s the genuinely impressive part), but underneath that may be a knowledge retrieval process just like Google Search. And most actual patient cases usually are not classic.
My experiment illustrated how the overwhelming majority of any medical encounter is determining the proper patient narrative. If someone comes into my ER saying their wrist hurts, but not because of any recent accident, it might be a psychosomatic response after the patient’s grandson fell down, or it might be because of a sexually transmitted disease, or something else entirely. The art of medication is extracting all of the crucial information required to create the suitable narrative.
Might ChatGPT still work as a physician’s assistant, routinely reading my patient notes during treatment and suggesting differentials? Possibly. But my fear is this might introduce even worse outcomes.
If my patient notes don’t include an issue I haven’t yet asked, ChatGPT’s output will encourage me to maintain missing that query. Like with my young female patient who didn’t know she was pregnant. If a possible ectopic pregnancy had not immediately occurred to me, ChatGPT would have kept enforcing that omission, only reflecting back to me the things I believed were obvious — enthusiastically validating my bias just like the world’s most dangerous yes-man.
None of this means AI has no potentially useful place in medicine, since it does.
As a human physician, I’m limited by what number of patients I can personally treat. I expect to see roughly 10,000 patients in my lifetime, each of them with a singular body mass, blood pressure, family history, and so forth — an enormous number of features I track in my mental model. Each human has countless variables relevant to their health, but as a human doctor working with a limited session window, I give attention to the several aspects that are inclined to be crucial historically.
So as an example, if I review a patient’s blood test and see high levels of hemoglobin A1C, then I diagnose them as more likely to have the early stages of diabetes. But what if I could keep track of the countless variables concerning the person’s health and compare them with other individuals who were similar across all of the hundreds of thousands of variables, not only based on their hemoglobin A1C? Perhaps then I could recognize that the opposite 100,000 patients who looked similar to this patient in front of me across that wide selection of things had an excellent consequence once they began to eat more broccoli.
That is the space where AI can thrive, tirelessly processing these countless features of each patient I’ve ever treated, and each other patient treated by every other physician, giving us deep, vast insights. AI may also help do that eventually, but it’ll first must ingest hundreds of thousands of patient data sets that include those many features, the things the patients did (like take a selected medication), and the consequence.
Within the meantime, we urgently need a rather more realistic view from Silicon Valley and the general public at large of what AI can do now — and its many, often dangerous, limitations. We have to be very careful to avoid inflated expectations with programs like ChatGPT, because within the context of human health, they’ll literally be life-threatening.
Originally published in FastCompany
Dr. Josh Tamayo-Sarver works clinically within the emergency department of his area people and is a vice chairman of innovation at Inflect Health, an innovation incubator for health tech.
Thanks for the thoughtful comments.
First, I used ChatGPT 3.5, but I feel that there’s a more fundamental issue across the mechanism by which a large-language model functions that was the larger discovery for me quite than the degree of coaching or specialization.
I’m unsure the big language model approach might be the reply to the problem-solving function in a medical encounter. Large language model AI is actually noting the association between words and has no underlying conceptual model. As such, some amazing behaviors have emerged and I exploit it day by day for nonclinical tasks.
Fascinated about the training around word associations, plainly a LLM is designed around knowledge retrieval and presentation of that knowledge. Within the medical diagnosis use case, step one is making a well-articulated problem statement of what is occurring with the patient, which requires numerous problem solving, which requires a conceptual model, which a LLM doesn’t have.
Nonetheless, I actually have seen knowledge graph-based AI systems do that incredibly well. Once that problem statement of what’s goin on with the patient is well-articulated, then it becomes a knowledge retrieval problem, which I’d expect ChatGPT and other large language model systems to excel at.
I can imagine a future where you might have different AI models which might be built through different techniques work synergistically on different tasks to unravel what currently appear to be very complex problems.
Just my 2 cents — although loads of behaviors have emerged from LLMs that I’d not have anticipated.
relaxing music sleep