A big language model (LLM) deployed to make treatment recommendations might be tripped up by nonclinical information in patient messages, like typos, extra white space, missing gender markers, or using uncertain, dramatic, and informal language, in response to a study by MIT researchers.
They found that making stylistic or grammatical changes to messages increases the likelihood an LLM will recommend that a patient self-manage their reported health condition fairly than are available in for an appointment, even when that patient should seek medical care.
Their evaluation also revealed that these nonclinical variations in text, which mimic how people really communicate, usually tend to change a model’s treatment recommendations for female patients, leading to the next percentage of ladies who were erroneously advised not to hunt medical care, in response to human doctors.
This work “is robust evidence that models should be audited before use in health care — which is a setting where they’re already in use,” says Marzyeh Ghassemi, an associate professor within the MIT Department of Electrical Engineering and Computer Science (EECS), a member of the Institute of Medical Engineering Sciences and the Laboratory for Information and Decision Systems, and senior writer of the study.
These findings indicate that LLMs take nonclinical information into consideration for clinical decision-making in previously unknown ways. It brings to light the necessity for more rigorous studies of LLMs before they’re deployed for high-stakes applications like making treatment recommendations, the researchers say.
“These models are sometimes trained and tested on medical exam questions but then utilized in tasks which might be pretty removed from that, like evaluating the severity of a clinical case. There remains to be a lot about LLMs that we don’t know,” adds Abinitha Gourabathina, an EECS graduate student and lead writer of the study.
They’re joined on the paper, which will likely be presented on the ACM Conference on Fairness, Accountability, and Transparency, by graduate student Eileen Pan and postdoc Walter Gerych.
Mixed messages
Large language models like OpenAI’s GPT-4 are getting used to draft clinical notes and triage patient messages in health care facilities across the globe, in an effort to streamline some tasks to assist overburdened clinicians.
A growing body of labor has explored the clinical reasoning capabilities of LLMs, especially from a fairness standpoint, but few studies have evaluated how nonclinical information affects a model’s judgment.
Involved in how gender impacts LLM reasoning, Gourabathina ran experiments where she swapped the gender cues in patient notes. She was surprised that formatting errors within the prompts, like extra white space, caused meaningful changes within the LLM responses.
To explore this problem, the researchers designed a study wherein they altered the model’s input data by swapping or removing gender markers, adding colourful or uncertain language, or inserting extra space and typos into patient messages.
Each perturbation was designed to mimic text that may be written by someone in a vulnerable patient population, based on psychosocial research into how people communicate with clinicians.
As an illustration, extra spaces and typos simulate the writing of patients with limited English proficiency or those with less technological aptitude, and the addition of uncertain language represents patients with health anxiety.
“The medical datasets these models are trained on are frequently cleaned and structured, and never a really realistic reflection of the patient population. We desired to see how these very realistic changes in text could impact downstream use cases,” Gourabathina says.
They used an LLM to create perturbed copies of hundreds of patient notes while ensuring the text changes were minimal and preserved all clinical data, similar to medication and former diagnosis. Then they evaluated 4 LLMs, including the big, business model GPT-4 and a smaller LLM built specifically for medical settings.
They prompted each LLM with three questions based on the patient note: Should the patient manage at home, should the patient are available in for a clinic visit, and will a medical resource be allocated to the patient, like a lab test.
The researchers compared the LLM recommendations to real clinical responses.
Inconsistent recommendations
They saw inconsistencies in treatment recommendations and significant disagreement among the many LLMs once they were fed perturbed data. Across the board, the LLMs exhibited a 7 to 9 percent increase in self-management suggestions for all nine sorts of altered patient messages.
This implies LLMs were more prone to recommend that patients not seek medical care when messages contained typos or gender-neutral pronouns, as an illustration. Using colourful language, like slang or dramatic expressions, had the most important impact.
In addition they found that models made about 7 percent more errors for female patients and were more prone to recommend that female patients self-manage at home, even when the researchers removed all gender cues from the clinical context.
Lots of the worst results, like patients told to self-manage once they have a serious medical condition, likely wouldn’t be captured by tests that deal with the models’ overall clinical accuracy.
“In research, we tend to have a look at aggregated statistics, but there are quite a lot of things which might be lost in translation. We want to have a look at the direction wherein these errors are occurring — not recommending visitation when you need to is far more harmful than doing the alternative,” Gourabathina says.
The inconsistencies attributable to nonclinical language turn into much more pronounced in conversational settings where an LLM interacts with a patient, which is a typical use case for patient-facing chatbots.
But in follow-up work, the researchers found that these same changes in patient messages don’t affect the accuracy of human clinicians.
“In our follow up work under review, we further find that giant language models are fragile to changes that human clinicians will not be,” Ghassemi says. “This is probably unsurprising — LLMs weren’t designed to prioritize patient medical care. LLMs are flexible and performant enough on average that we’d think that is use case. But we don’t need to optimize a health care system that only works well for patients in specific groups.”
The researchers need to expand on this work by designing natural language perturbations that capture other vulnerable populations and higher mimic real messages. In addition they need to explore how LLMs infer gender from clinical text.