Through the years, Large Language Models (LLMs) have emerged as a groundbreaking technology with immense potential to revolutionize various elements of healthcare. These models, similar to GPT-3, GPT-4 and Med-PaLM 2 have demonstrated remarkable capabilities in understanding and generating human-like text, making them useful tools for tackling complex medical tasks and improving patient care. They’ve notably shown promise in various medical applications, similar to medical question-answering (QA), dialogue systems, and text generation. Furthermore, with the exponential growth of electronic health records (EHRs), medical literature, and patient-generated data, LLMs could help healthcare professionals extract useful insights and make informed decisions.
Nonetheless, despite the immense potential of Large Language Models (LLMs) in healthcare, there are significant and specific challenges that should be addressed.
When models are used for recreational conversational elements, errors have little repercussions; this will not be the case for uses within the medical domain nevertheless, where unsuitable explanation and answers can have severe consequences for patient care and outcomes. The accuracy and reliability of data provided by language models generally is a matter of life or death, because it could potentially affect healthcare decisions, diagnosis, and treatment plans.
For instance, when given a medical query (see below), GPT-3 incorrectly really helpful tetracycline for a pregnant patient, despite accurately explaining its contraindication on account of potential harm to the fetus. Acting on this incorrect suggestion may lead to bone growth problems in the newborn.
To completely utilize the facility of LLMs in healthcare, it’s crucial to develop and benchmark models using a setup specifically designed for the medical domain. This setup should bear in mind the unique characteristics and requirements of healthcare data and applications. The event of methods to judge the Medical-LLM will not be just of educational interest but of practical importance, given the real-life risks they pose within the healthcare sector.
The Open Medical-LLM Leaderboard goals to deal with these challenges and limitations by providing a standardized platform for evaluating and comparing the performance of varied large language models on a various range of medical tasks and datasets. By offering a comprehensive assessment of every model’s medical knowledge and question-answering capabilities, the leaderboard goals to foster the event of simpler and reliable medical LLMs.
This platform enables researchers and practitioners to discover the strengths and weaknesses of various approaches, drive further advancements in the sector, and ultimately contribute to raised patient care and outcomes
Datasets, Tasks, and Evaluation Setup
The Medical-LLM Leaderboard includes quite a lot of tasks, and uses accuracy as its primary evaluation metric (accuracy measures the share of correct answers provided by a language model across the assorted medical QA datasets).
MedQA
The MedQA dataset consists of multiple-choice questions from america Medical Licensing Examination (USMLE). It covers general medical knowledge and includes 11,450 questions in the event set and 1,273 questions within the test set. Each query has 4 or 5 answer selections, and the dataset is designed to evaluate the medical knowledge and reasoning skills required for medical licensure in america.
MedMCQA
MedMCQA is a large-scale multiple-choice QA dataset derived from Indian medical entrance examinations (AIIMS/NEET). It covers 2.4k healthcare topics and 21 medical subjects, with over 187,000 questions in the event set and 6,100 questions within the test set. Each query has 4 answer selections and is accompanied by an evidence. MedMCQA evaluates a model’s general medical knowledge and reasoning capabilities.
PubMedQA
PubMedQA is a closed-domain QA dataset, Through which each query will be answered by taking a look at an associated context (PubMed abstract). It’s consists of 1,000 expert-labeled question-answer pairs. Each query is accompanied by a PubMed abstract as context, and the duty is to supply a yes/no/possibly answer based on the data within the abstract. The dataset is split into 500 questions for development and 500 for testing. PubMedQA assesses a model’s ability to understand and reason over scientific biomedical literature.
MMLU Subsets (Medicine and Biology)
The MMLU benchmark (Measuring Massive Multitask Language Understanding) includes multiple-choice questions from various domains. For the Open Medical-LLM Leaderboard, we deal with the subsets most relevant to medical knowledge:
- Clinical Knowledge: 265 questions assessing clinical knowledge and decision-making skills.
- Medical Genetics: 100 questions covering topics related to medical genetics.
- Anatomy: 135 questions evaluating the knowledge of human anatomy.
- Skilled Medicine: 272 questions assessing knowledge required for medical professionals.
- College Biology: 144 questions covering college-level biology concepts.
- College Medicine: 173 questions assessing college-level medical knowledge.
Each MMLU subset consists of multiple-choice questions with 4 answer options and is designed to judge a model’s understanding of specific medical and biological domains.
The Open Medical-LLM Leaderboard offers a sturdy assessment of a model’s performance across various elements of medical knowledge and reasoning.
Insights and Evaluation
The Open Medical-LLM Leaderboard evaluates the performance of varied large language models (LLMs) on a various set of medical question-answering tasks. Listed below are our key findings:
- Industrial models like GPT-4-base and Med-PaLM-2 consistently achieve high accuracy scores across various medical datasets, demonstrating strong performance in several medical domains.
- Open-source models, similar to Starling-LM-7B, gemma-7b, Mistral-7B-v0.1, and Hermes-2-Pro-Mistral-7B, show competitive performance on certain datasets and tasks, despite having smaller sizes of around 7 billion parameters.
- Each industrial and open-source models perform well on tasks like comprehension and reasoning over scientific biomedical literature (PubMedQA) and applying clinical knowledge and decision-making skills (MMLU Clinical Knowledge subset).
Google’s model, Gemini Pro demonstrates strong performance in various medical domains, particularly excelling in data-intensive and procedural tasks like Biostatistics, Cell Biology, and Obstetrics & Gynecology. Nonetheless, it shows moderate to low performance in critical areas similar to Anatomy, Cardiology, and Dermatology, revealing gaps that require further refinement for comprehensive medical application.
Submitting Your Model for Evaluation
To submit your model for evaluation on the Open Medical-LLM Leaderboard, follow these steps:
1. Convert Model Weights to Safetensors Format
First, convert your model weights to the safetensors format. Safetensors is a brand new format for storing weights that’s safer and faster to load and use. Converting your model to this format may even allow the leaderboard to display the variety of parameters of your model within the predominant table.
2. Ensure Compatibility with AutoClasses
Before submitting your model, be certain that you possibly can load your model and tokenizer using the AutoClasses from the Transformers library. Use the next code snippet to check the compatibility:
from transformers import AutoConfig, AutoModel, AutoTokenizer
config = AutoConfig.from_pretrained(MODEL_HUB_ID)
model = AutoModel.from_pretrained("your model name")
tokenizer = AutoTokenizer.from_pretrained("your model name")
If this step fails, follow the error messages to debug your model before submitting it. It’s likely that your model has been improperly uploaded.
3. Make Your Model Public
Be certain that your model is publicly accessible. The leaderboard cannot evaluate models which can be private or require special access permissions.
4. Distant Code Execution (Coming Soon)
Currently, the Open Medical-LLM Leaderboard doesn’t support models that require use_remote_code=True. Nonetheless, the leaderboard team is actively working on adding this feature, so stay tuned for updates.
5. Submit Your Model via the Leaderboard Website
Once your model is within the safetensors format, compatible with AutoClasses, and publicly accessible, you possibly can submit it for evaluation using the “Submit here!” panel on the Open Medical-LLM Leaderboard website. Fill out the required information, similar to the model name, description, and any additional details, and click on the submit button.
The leaderboard team will process your submission and evaluate your model’s performance on the assorted medical QA datasets. Once the evaluation is complete, your model’s scores will probably be added to the leaderboard, allowing you to check its performance with other submitted models.
What’s next? Expanding the Open Medical-LLM Leaderboard
The Open Medical-LLM Leaderboard is committed to expanding and adapting to satisfy the evolving needs of the research community and healthcare industry. Key areas of focus include:
- Incorporating a wider range of medical datasets covering diverse elements of healthcare, similar to radiology, pathology, and genomics, through collaboration with researchers, healthcare organizations, and industry partners.
- Enhancing evaluation metrics and reporting capabilities by exploring additional performance measures beyond accuracy, similar to Pointwise rating and domain-specific metrics that capture the unique requirements of medical applications.
- A couple of efforts are already underway on this direction. In the event you are eager about collaborating on the following benchmark we’re planning to propose, please join our Discord community to learn more and get entangled. We might like to collaborate and brainstorm ideas!
In the event you’re enthusiastic about the intersection of AI and healthcare, constructing models for the healthcare domain, and care about safety and hallucination issues for medical LLMs, we invite you to affix our vibrant community on Discord.
Credits and Acknowledgments
Special due to all of the individuals who helped make this possible, including Clémentine Fourrier and the Hugging Face team. I would really like to thank Andreas Motzfeldt, Aryo Gema, & Logesh Kumar Umapathi for his or her discussion and feedback on the leaderboard during development. Sincere gratitude to Prof. Pasquale Minervini for his time, technical assistance, and for providing GPU support from the University of Edinburgh.
About Open Life Science AI
Open Life Science AI is a project that goals to revolutionize the appliance of Artificial intelligence within the life science and healthcare domains. It serves as a central hub for list of medical models, datasets, benchmarks, and tracking conference deadlines, fostering collaboration, innovation, and progress in the sector of AI-assisted healthcare. We attempt to ascertain Open Life Science AI because the premier destination for anyone eager about the intersection of AI and healthcare. We offer a platform for researchers, clinicians, policymakers, and industry experts to have interaction in dialogues, share insights, and explore the most recent developments in the sector.
Citation
In the event you find our evaluations useful, please consider citing our work
Medical-LLM Leaderboard
@misc{Medical-LLM Leaderboard,
creator = {Ankit Pal, Pasquale Minervini, Andreas Geert Motzfeldt, Aryo Pradipta Gema and Beatrice Alex},
title = {openlifescienceai/open_medical_llm_leaderboard},
12 months = {2024},
publisher = {Hugging Face},
howpublished = "url{https://huggingface.co/spaces/openlifescienceai/open_medical_llm_leaderboard}"
}


](https://github.com/monk1337/research_assets/blob/main/huggingface_blog/gpt_medicaltest.png?raw=true)




](https://github.com/monk1337/research_assets/blob/main/huggingface_blog/model_evals.png?raw=true)
](https://github.com/monk1337/research_assets/blob/main/huggingface_blog/subjectwise_eval.png?raw=true)

