Can LangExtract Turn Messy Clinical Notes into Structured Data?

-


LangExtract is a from developers at Google that makes it easy to show messy, unstructured text into clean, structured data by leveraging LLMs. Users can provide just a few few-shot examples together with a custom schema and get results based on that. It really works each with proprietary in addition to local LLMs (via Ollama). 

A big amount of information in healthcare is unstructured, making it a super area where a tool like this will be useful. Clinical notes are long and stuffed with abbreviations and inconsistencies. Necessary details similar to drug names, dosages, and particularly adversarial drug reactions (ADRs) get buried within the text. Due to this fact, for this text, I desired to see if LangExtract could handle adversarial drug response (ADR) detection in clinical notes. More importantly, is it effective? Let’s discover in this text. Note that while LangExtract is an open-source project from developers at Google, it shouldn’t be an officially supported Google product.

▶️ Here is an in depth Kaggle notebook to follow along.

Why ADR Extraction Matters

An Adversarial Drug Response (ADR) is a harmful, unintended result brought on by taking a medicine. These can range from mild negative effects like nausea or dizziness to severe outcomes that will require medical attention. 

Patient takes medicine for headache but develops stomach pain; a typical Adversarial Drug Response (ADR) | Image created by creator using ChatGPT

Detecting them quickly is critical for patient safety and pharmacovigilance. The challenge is that in clinical notes, ADRs are buried alongside past conditions, lab results, and other context. Consequently, detecting them is hard. Using LLMs to detect ADRs is an ongoing area of research. Some recent works have shown that LLMs are good at raising red flags but not reliable. So, ADR extraction is a superb stress test for LangExtract, because the goal here is to see if this library can spot the adversarial reactions amongst other entities in clinical notes like medications, dosages, severity, etc.

How LangExtract Works

Before we jump into usage, let’s break down LangExtract’s workflow. It’s a straightforward three-step process:

  1. Define your extraction task by writing a transparent prompt that specifies exactly what you must extract. 
  2. Provide just a few high-quality examples to guide the model towards the format and level of detail you expect.
  3. Submit your input text, select the model, and let LangExtract process it. Users can then review the outcomes, visualize them, or pass them directly into their downstream pipeline.

Installation

First we want to put in the LangExtract library. It’s at all times a superb idea to do that inside a virtual environment to maintain your project dependencies isolated. 

pip install langextract

Identifying Adversarial Drug Reactions in Clinical Notes with LangExtract & Gemini

Now let’s get to our use case. For this walkthrough, I’ll use Google’s Gemini 2.5 Flash model. You might also use Gemini Pro for more complex reasoning tasks. You’ll have to first set your API key:

export LANGEXTRACT_API_KEY="your-api-key-here"

▶️ Here is an in depth Kaggle notebook to follow along.

Step 1: Define the Extraction Task

Let’s create our prompt for extracting medications, dosages, adversarial reactions, and actions taken. We can even ask for severity where mentioned. 

prompt = textwrap.dedent("""
Extract medication, dosage, adversarial response, and motion taken from the text.
For every adversarial response, include its severity as an attribute if mentioned.
Use exact text spans from the unique text. Don't paraphrase.
Return entities within the order they seem.""")
The note highlights ibuprofen (400 mg), the adversarial response (mild stomach pain), and the motion taken (stopping the drugs). That is what ADR extraction looks like in practice. | Image by Creator

Next, let’s provide an example to guide the model towards the proper format:

# 1) Define the prompt
prompt = textwrap.dedent("""
Extract condition, medication, dosage, adversarial response, and motion taken from the text.
For every adversarial response, include its severity as an attribute if mentioned.
Use exact text spans from the unique text. Don't paraphrase.
Return entities within the order they seem.""")

# 2) Example 
examples = [
    lx.data.ExampleData(
        text=(
            "After taking ibuprofen 400 mg for a headache, "
            "the patient developed mild stomach pain. "
            "They stopped taking the medicine."
        ),
        extractions=[
            
            lx.data.Extraction(
                extraction_class="condition",
                extraction_text="headache"
            ),
        
            lx.data.Extraction(
                extraction_class="medication",
                extraction_text="ibuprofen"
            ),
            lx.data.Extraction(
                extraction_class="dosage",
                extraction_text="400 mg"
            ),
            lx.data.Extraction(
                extraction_class="adverse_reaction",
                extraction_text="mild stomach pain",
                attributes={"severity": "mild"}
            ),
            lx.data.Extraction(
                extraction_class="action_taken",
                extraction_text="They stopped taking the medicine"
            )
        ]
    )
]

Step 2: Provide the Input and Run the Extraction

For the input, I’m using an actual clinical sentence from the ADE Corpus v2 dataset on Hugging Face.

input_text = (
    "A 27-year-old man who had a history of bronchial asthma, "
    "eosinophilic enteritis, and eosinophilic pneumonia presented with "
    "fever, skin eruptions, cervical lymphadenopathy, hepatosplenomegaly, "
    "atypical lymphocytosis, and eosinophilia two weeks after receiving "
    "trimethoprim (TMP)-sulfamethoxazole (SMX) treatment."
)

Next, let’s run LangExtract with the Gemini-2.5-Flash model.

result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
    api_key=LANGEXTRACT_API_KEY 
)

Step 3: View the Results

You may display the extracted entities with positions

print(f"Input: {input_text}n")
print("Extracted entities:")
for entity in result.extractions:
    position_info = ""
    if entity.char_interval:
        start, end = entity.char_interval.start_pos, entity.char_interval.end_pos
        position_info = f" (pos: {start}-{end})"
    print(f"• {entity.extraction_class.capitalize()}: {entity.extraction_text}{position_info}")

LangExtract appropriately identifies the adversarial drug response without confusing it with the patient’s pre-existing conditions, which is a key challenge in any such task.

If you must visualize it, it’s going to create this .jsonl file. You may load that .jsonl file by calling the visualization function, and it’s going to create an HTML file for you.

lx.io.save_annotated_documents(
    [result],
    output_name="adr_extraction.jsonl",
    output_dir="."
)

html_content = lx.visualize("adr_extraction.jsonl")

# Display the HTML content directly
display((html_content))

Working with longer clinical notes

Real clinical notes are sometimes for much longer than the instance shown above. As an example, here is an actual note from the dataset released under the MIT License. You may access it on Hugging Face or Zenodo

Excerpt from a clinical note from the dataset released under the MIT License | Image by the creator

To process longer texts with LangExtract, you retain the identical workflow but add three parameters:

extraction_passes runs multiple passes over the text to catch more details and improve recall. 

max_workers controls parallel processing so larger documents will be handled faster. 

max_char_buffer splits the text into smaller chunks, which helps the model stay accurate even when the input may be very long.

result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-flash",
    extraction_passes=3,    
    max_workers=20,         
    max_char_buffer=1000   
)

Here is the output. For brevity, I’m only showing a portion of the output here.


In case you want, you can even pass a document’s URL on to the text_or_documents parameter.


Using LangExtract with Local models via Ollama

LangExtract isn’t limited to proprietary APIs. You too can run it with local models through Ollama. This is particularly useful when working with privacy-sensitive clinical data that may’t leave your secure environment. You may arrange Ollama locally, pull your selected model, and point LangExtract to it. Full instructions can be found within the official docs.

Conclusion

In case you’re constructing an information retrieval system or any application involving metadata extraction, LangExtract can prevent a big amount of preprocessing effort. In my ADR experiments, LangExtract performed well, appropriately identifying medications, dosages, and reactions. What I noticed is that the output directly relies on the standard of the few-shot examples provided by the user, which implies while LLMs do the heavy lifting, humans still remain a very important a part of the loop. The outcomes were encouraging, but since clinical data is high-risk, broader and more rigorous testing across diverse datasets remains to be needed before moving toward production use.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x