Large language models help decipher clinical notes


Electronic health records (EHRs) need a latest public relations manager. Ten years ago, the U.S. government passed a law that strongly encouraged the adoption of electronic health records with the intent of improving and streamlining care. The big amount of knowledge in these now-digital records might be used to reply very specific questions beyond the scope of clinical trials: What’s the appropriate dose of this medication for patients with this height and weight? What about patients with a selected genomic profile?

Unfortunately, a lot of the data that might answer these questions is trapped in doctor’s notes, stuffed with jargon and abbreviations. These notes are hard for computers to grasp using current techniques — extracting information requires training multiple machine learning models. Models trained for one hospital, also, don’t work well at others, and training each model requires domain experts to label plenty of data, a time-consuming and expensive process. 

A great system would use a single model that may extract many kinds of information, work well at multiple hospitals, and learn from a small amount of labeled data. But how? Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) led by Monica Agrawal, a PhD candidate in electrical engineering and computer science, believed that to disentangle the information, they needed to call on something greater: large language models. To tug that necessary medical information, they used a really big, GPT-3 style model to do tasks like expand overloaded jargon and acronyms and extract medication regimens. 

For instance, the system takes an input, which on this case is a clinical note, “prompts” the model with an issue in regards to the note, similar to “expand this abbreviation, C-T-A.” The system returns an output similar to “clear to auscultation,” versus say, a CT angiography. The target of extracting this clean data, the team says, is to eventually enable more personalized clinical recommendations. 

Medical data is, understandably, a fairly tricky resource to navigate freely. There’s loads of red tape around using public resources for testing the performance of huge models because of knowledge use restrictions, so the team decided to scrape together their very own. Using a set of short, publicly available clinical snippets, they cobbled together a small dataset to enable evaluation of the extraction performance of huge language models. 

“It’s difficult to develop a single general-purpose clinical natural language processing system that may solve everyone’s needs and be robust to the massive variation seen across health datasets. Consequently, until today, most clinical notes usually are not utilized in downstream analyses or for live decision support in electronic health records. These large language model approaches could potentially transform clinical natural language processing,” says David Sontag, MIT professor of electrical engineering and computer science, principal investigator in CSAIL and the Institute for Medical Engineering and Science, and supervising writer on a paper in regards to the work, which might be presented on the Conference on Empirical Methods in Natural Language Processing. “The research team’s advances in zero-shot clinical information extraction makes scaling possible. Even when you will have a whole bunch of various use cases, no problem — you possibly can construct each model with a couple of minutes of labor, versus having to label a ton of knowledge for that specific task.”

For instance, with none labels in any respect, the researchers found these models could achieve 86 percent accuracy at expanding overloaded acronyms, and the team developed additional methods to spice up this further to 90 percent accuracy, with still no labels required.

Imprisoned in an EHR 

Experts have been steadily build up large language models (LLMs) for quite a while, but they burst onto the mainstream with GPT-3’s widely covered ability to finish sentences. These LLMs are trained on an enormous amount of text from the web to complete sentences and predict the following most definitely word. 

While previous, smaller models like earlier GPT iterations or BERT have pulled off a superb performance for extracting medical data, they still require substantial manual data-labeling effort. 

For instance, a note, “pt will dc vanco attributable to n/v” signifies that this patient (pt) was taking the antibiotic vancomycin (vanco) but experienced nausea and vomiting (n/v) severe enough for the care team to discontinue (dc) the medication. The team’s research avoids the established order of coaching separate machine learning models for every task (extracting medication, negative effects from the record, disambiguating common abbreviations, etc). Along with expanding abbreviations, they investigated 4 other tasks, including if the models could parse clinical trials and extract detail-rich medication regimens.  

“Prior work has shown that these models are sensitive to the prompt’s precise phrasing. A part of our technical contribution is a option to format the prompt in order that the model gives you outputs in the right format,” says Hunter Lang, CSAIL PhD student and writer on the paper. “For these extraction problems, there are structured output spaces. The output space shouldn’t be only a string. It may well be an inventory. It may well be a quote from the unique input. So there’s more structure than simply free text. A part of our research contribution is encouraging the model to offer you an output with the right structure. That significantly cuts down on post-processing time.”

The approach can’t be applied to out-of-the-box health data at a hospital: that requires sending private patient information across the open web to an LLM provider like OpenAI. The authors showed that it’s possible to work around this by distilling the model right into a smaller one which might be used on-site.

The model — sometimes similar to humans — shouldn’t be all the time beholden to the reality. Here’s what a possible problem might appear like: Let’s say you’re asking the explanation why someone took medication. Without proper guardrails and checks, the model might just output probably the most common reason for that medication, if nothing is explicitly mentioned within the note. This led to the team’s efforts to force the model to extract more quotes from data and fewer free text.

Future work for the team includes extending to languages aside from English, creating additional methods for quantifying uncertainty within the model, and pulling off similar results with open-sourced models. 

“Clinical information buried in unstructured clinical notes has unique challenges in comparison with general domain text mostly attributable to large use of acronyms, and inconsistent textual patterns used across different health care facilities,” says Sadid Hasan, AI lead at Microsoft and former executive director of AI at CVS Health, who was not involved within the research. “To this end, this work sets forth an interesting paradigm of leveraging the ability of general domain large language models for several necessary zero-/few-shot clinical NLP tasks. Specifically, the proposed guided prompt design of LLMs to generate more structured outputs may lead to further developing smaller deployable models by iteratively utilizing the model generated pseudo-labels.”

“AI has accelerated within the last five years to the purpose at which these large models can predict contextualized recommendations with advantages rippling out across a wide range of domains similar to suggesting novel drug formulations, understanding unstructured text, code recommendations or create artworks inspired by any variety of human artists or styles,” says Parminder Bhatia, who was formerly head of machine learning at AWS Health AI and is currently head of machine learning for low-code applications leveraging large language models at AWS AI Labs.

As a part of the MIT Abdul Latif Jameel Clinic for Machine Learning in Health, Agrawal, Sontag, and Lang wrote the paper alongside Yoon Kim, MIT assistant professor and CSAIL principal investigator, and Stefan Hegselmann, a visiting PhD student from the University of Muenster. First-author Agrawal’s research was supported by a Takeda Fellowship, the MIT Deshpande Center for Technological Innovation, and the MLA@CSAIL Initiatives.


What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
Inline Feedbacks
View all comments

Share this article

Recent posts

Would love your thoughts, please comment.x