Deploying a PICO Extractor in Five Steps

language models has made many Natural Processing (NLP) tasks appear effortless. Tools like ChatGPT sometimes generate strikingly good responses, leading even seasoned professionals to wonder if some jobs is likely to be handed over to algorithms sooner quite than later. Yet, as impressive as these models are, they still discover tasks requiring precise, domain-specific extraction.

Motivation: Why Construct a PICO Extractor?

The concept arose during a conversation with a student, graduating in International Healthcare Management, who set out to research future trends in Parkinson’s treatment and to calculate potential costs awaiting insurances, if the present trials turn right into a successful product. Step one was classic and laborious: isolate PICO elements—Population, Intervention, Comparator, and Final result descriptions—from running trial descriptions published on clinicaltrials.gov. This PICO framework is commonly utilized in evidence-based medicine to structure clinical trial data. Since she was neither a coder nor an NLP specialist, she did this entirely by hand, working with spreadsheets. It became clear to me that, even within the LLM era, there may be real demand for straightforward, reliable tools for biomedical information extraction.

Step 1: Understanding the Data and Setting Goals

As in every data project, the primary order of business is setting clear goals and identifying who will use the outcomes. Here, the target was to extract PICO elements for downstream predictive analyses or meta-research. The audience: anyone taken with systematically analyzing clinical trial data, be it researchers, clinicians, or data scientists. With this scope in mind, I began with exports from clinicaltrials.gov in JSON format. Initial field extraction and data cleansing provided some structured information (Table 1) — especially for interventions — but other key fields were still unmanageably verbose for downstream automated analyses. That is where NLP shines: it enables us to distill crucial details from unstructured text comparable to eligibility criteria or tested drugs. Named Entity Recognition (NER) enables automated detection and classification of key entities—for instance, identifying the population group described in an eligibility section, or pinpointing end result measures inside a study summary. Thus, the project naturally transitioned from basic preprocessing to the implementation of domain-adapted NER models.

Table 1: Key elements from clinicaltrials.gov information on two Alzheimer’s studies, extracted from data, downloaded from their site. (image by writer)

Step 2: Benchmarking Existing Models

My next step was a survey of off-the-shelf NER models, especially those trained on biomedical literature and available via Huggingface, the central repository for transformer models. Out of 19 candidates, only BioELECTRA-PICO (110 million parameters) [1] worked directly for extracting PICO elements, while the others are trained on the NER task, but not specifically on PICO recognition. Testing BioELECTRA by myself “gold-standard” set of 20 manually annotated trials showed acceptable but removed from ideal performance, with particular weakness on the “Comparator” element. This was likely because comparators are rarely described within the trial summaries, forcing a return to a practical rule-based approach, searching directly the intervention text for normal comparator keywords comparable to “placebo” or “usual care.”

Step 3: Wonderful-Tuning with Domain-Specific Data

To further improve performance, I moved to fine-tuning, which was made possible due to annotated PICO datasets from BIDS-Xu-Lab, including Alzheimer’s-specific samples [2]. With a view to balance the necessity for top accuracy with efficiency and scalability, I chosen three models for experimentation. BioBERT-v1.1, with 110 million parameters [3], served as the first model because of its strong track record in biomedical NLP tasks. I also included two smaller, derived models to optimize for speed and memory usage: CompactBioBERT, at 65 million parameters, is a distilled version of BioBERT-v1.1; and BioMobileBERT, at just 25 million parameters, is an extra compressed variant, which underwent an extra round of continual learning after compression [4]. I fine-tuned all three models using Google Colab GPUs, which allowed for efficient training—each model was ready for testing in under two hours.

Step 4: Evaluation and Insights

The outcomes, summarized in Table 2, reveal clear trends. All variants performed strongly on extracting Population, with BioMobileBERT leading at F1 = 0.91. Final result extraction was near ceiling across all models. Nonetheless, extracting Interventions proved more difficult. Although recall was quite high (0.83–0.87), precision lagged (0.54–0.61), with models regularly tagging extra medication mentions present in the free text—actually because trial descriptions consult with drugs or “intervention-like” keywords describing the background but not necessarily specializing in the planned essential intervention.

On closer inspection, this highlights the complexity of biomedical NER. Interventions occasionally appeared as short, fragmented strings like “use of whole,” “week,” “top,” or “tissues with”, that are of little value for a researcher attempting to make sense of a compiled list of studies. Similarly, examining the population yielded quite sobering examples comparable to “percent of” or “states with”, pointing to the necessity for added cleanup and pipeline optimization. At the identical time, the models could extract impressively detailed population descriptors, like “qualifying adults with a diagnosis of cognitively unimpaired, or probable Alzheimer’s disease, frontotemporal dementia, or dementia with Lewy bodies”. While such long strings could be correct, they have an inclination to be too verbose for practical summarization because each trial’s participant description is so specific, often requiring some type of abstraction or standardization.

This underscores a classic challenge in biomedical NLP: context matters, and domain-specific text often resists purely generic extraction methods. For Comparator elements, a rule-based approach (matching explicit comparator keywords) worked best, reminding us that mixing statistical learning with pragmatic heuristics is commonly probably the most viable strategy in real-world applications.

One major source of those “mischief” extractions stems from how trials are described in broader context sections. Moving forward, possible improvements include adding a post-processing filter to discard short or ambiguous snippets, incorporating a domain-specific controlled vocabulary (so only recognized intervention terms are kept), or applying concept linking to known ontologies. These steps could help make sure that the pipeline produces cleaner, more standardized outputs.

Table 2: F1 for extraction of PICO elements, % of documents with all PICO elements partially correct, and process duration. (image by writer)

A word on performance: For any end-user tool, speed matters as much as accuracy. BioMobileBERT’s compact size translated to faster inference, making it my preferred model, especially because it performed optimally for Population, Comparator, and Final result elements.

Step 5: Making the Tool Usable—Deployment

Technical solutions are only as precious as they’re accessible. I wrapped the ultimate pipeline in a Streamlit app, allowing users to upload clinicaltrials.gov datasets, switch between models, extract PICO elements, and download results. Quick summary plots provide an at-a-glance view of top interventions and outcomes (see Figure 1). I deliberately left the underperforming BioELECTRA model for the user to match performance duration with the intention to appreciate the efficiency gains from using a smaller architecture. Although the tool got here too late to spare my student hours of manual data extraction, I hope it can profit others facing similar tasks.

To make deployment straightforward, I’ve containerized the app with Docker, so followers and collaborators can rise up and running quickly. I’ve also invested substantial effort into the GitHub repo [5], providing thorough documentation to encourage further contributions or adaptation for brand spanking new domains.

Lessons Learned

This project showcases the complete journey of developing a real-world extraction pipeline — from setting clear objectives and benchmarking existing models, to fine-tuning them on specialized data and deploying a user-friendly application. Although models and data were available for fine-tuning, turning them into a really great tool proved more difficult than expected. Coping with intricate, multi-word biomedical entities which were often only partially recognized, highlighted the bounds of one-size-fits-all solutions. The shortage of abstraction within the extracted text also became an obstacle for anyone aiming to discover global trends. Moving forward, more focused approaches and pipeline optimizations are needed quite than counting on a straightforward prêt-à-porter solution.

Figure 1. Sample output from the Streamlit app running BioMobileBERT and BioELECTRA for PICO extraction (image by writer).

For those who’re taken with extending this work, or adapting the approach for other biomedical tasks, I invite you to explore the repository [5] and contribute. Just fork the project and Glad Coding!

References

[1] S. Alrowili and V. Shanker, “BioM-Transformers: Constructing Large Biomedical Language Models with BERT, ALBERT and ELECTRA,” in , D. Demner-Fushman, K. B. Cohen, S. Ananiadou, and J. Tsujii, Eds., Online: Association for Computational Linguistics, June 2021, pp. 221–227. doi: 10.18653/v1/2021.bionlp-1.24.
[2] . (Aug. 23, 2025). Jupyter Notebook. Clinical NLP Lab. Accessed: Sept. 13, 2025. [Online]. Available: https://github.com/BIDS-Xu-Lab/section_specific_annotation_of_PICO
[3] J. Lee , “BioBERT: a pre-trained biomedical language representation model for biomedical text mining,” , vol. 36, no. 4, pp. 1234–1240, Feb. 2020, doi: 10.1093/bioinformatics/btz682.
[4] O. Rohanian, M. Nouriborji, S. Kouchaki, and D. A. Clifton, “On the effectiveness of compact biomedical transformers,” , vol. 39, no. 3, p. btad103, Mar. 2023, doi: 10.1093/bioinformatics/btad103.
[5] ElenJ, . (Sept. 13, 2025). Jupyter Notebook. Accessed: Sept. 13, 2025. [Online]. Available: https://github.com/ElenJ/biomed-extractor

Deploying a PICO Extractor in Five Steps

Motivation: Why Construct a PICO Extractor?

Step 1: Understanding the Data and Setting Goals

Step 2: Benchmarking Existing Models

Step 3: Wonderful-Tuning with Domain-Specific Data

Step 4: Evaluation and Insights

Step 5: Making the Tool Usable—Deployment

Lessons Learned

References

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

How LLMs Handle Infinite Context With Finite Memory

Reimagining LLM Memory: Using Context as Training Data Unlocks Models That Learn at Test-Time

How well can your Multimodal model jointly reason over text and image in text-rich scenes?

Data Science Highlight: Chosen Problems from Advent of Code 2025

Construct an AI Catalog System That Delivers Localized, Interactive Product Experiences

Deploying a PICO Extractor in Five Steps

Motivation: Why Construct a PICO Extractor?

Step 1: Understanding the Data and Setting Goals

Step 2: Benchmarking Existing Models

Step 3: Wonderful-Tuning with Domain-Specific Data

Step 4: Evaluation and Insights

Step 5: Making the Tool Usable—Deployment

Lessons Learned

References

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.