HMS MGB Broad YLab

πŸ“œ Background

Recent advances in Large Language Models (LLMs) have demonstrated transformative potential in healthcare, yet concerns remain around their reliability and clinical validity across diverse clinical tasks, specialties, and languages. To support timely and trustworthy evaluation, building upon our systematic review of global clinical text resources, we introduce BRIDGE, a multilingual benchmark that comprises 87 real-world clinical text tasks spanning nine languages and more than one million samples. Furthermore, we construct this leaderboard of LLM in clinical text understanding by systematically evaluating 52 state-of-the-art LLMs (by 2025/04/28).

This project is led and maintained by the team of Prof. Jie Yang and Prof. Kueiyu Joshua Lin at Harvard Medical School and Brigham and Women's Hospital.
dataset

πŸ† BRIDGE Leaderboard

BRIDGE features three leaderboards, each evaluating LLM performance in clinical text tasks under a distinct inference strategy:

In addition, BRIDGE offers multiple model filters and task filters to enable users to explore LLM performance across different clinical contexts, empowering researchers and clinicians to make informed decisions and track model advancements over time.

model

🌍 Key Features

More Details can be found in our BRIDGE paper and systematic review.

πŸ› οΈ How to Evaluate Your Model on BRIDGE ?

πŸ“‚ Dataset Access

All fully open-access datasets in BRIDGE are available in BRIDGE-Open. To ensure the fairness of this leaderboard, we publicly release the following data for each task: Five completed samples serve as few-shot examples, and all testing samples with instruction and input information.

Due to privacy and security considerations of clinical data, regulated-access datasets can not be directly published. However, all detailed task descriptions and their corresponding data sources are available in our BRIDGE paper. Importantly, all 87 datasets have been verified to be either fully open-access or publicly accessible via reasonable request in our systematic review.

πŸ”₯ Result Submission and Model Evaluation

If you would like to see how an unevaluated model performs on BRIDGE, please follow these steps:

We will review and evaluate your submission and update the leaderboard accordingly.

Code Reference: About LLM inference, result extraction, and evaluation scheme, please refer to our BRIDGE GitHub repo.

🚨 Important: Due to computational resource constraint, our team may not be able to test all proposed models, and there will be a delay in showing the results after your submission.

πŸ“’ Updates

🀝 Contributing

We welcome and greatly value contributions and collaborations from the community! If you have clinical text datasets that you would like to add to the BRIDEG benchmark, please fill in the Google Form and let us know!

We are committed to expanding BRIDGE while strictly adhering to appropriate data use agreements and ethical guidelines. Let's work together to advance the responsible application of LLMs in medicine!

πŸš€ Donation

BRIDGE is a non-profit, researcher-led benchmark that requires substantial resources (e.g., high-performance GPUs, a dedicated team) to sustain. To support open and impactful academic research that advances clinical care, we welcome your contributions. Please contact Prof. Jie Yang at jyang66@bwh.harvard.edu to discuss donation opportunities.

πŸ“¬ Contact Information

If you have any questions about BRIDGE or the leaderboard, feel free to contact us!

πŸ“š Citation

If you find this leaderboard useful for your research and applications, please cite the following papers:

@article{BRIDGE-benchmark,
    title={BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text},
    author={Wu, Jiageng and Gu, Bowen and Zhou, Ren and Xie, Kevin and Snyder, Doug and Jiang, Yixing and Carducci, Valentina and Wyss, Richard and Desai, Rishi J and Alsentzer, Emily and Celi, Leo Anthony and Rodman, Adam and Schneeweiss, Sebastian and Chen, Jonathan H. and Romero-Brufau, Santiago and Lin, Kueiyu Joshua and Yang, Jie},
    year={2025},
    journal={arXiv preprint arXiv: 2504.19467},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2504.19467},
}
@article{clinical-text-review,
    title={Clinical text datasets for medical artificial intelligence and large language modelsβ€”a systematic review},
    author={Wu, Jiageng and Liu, Xiaocong and Li, Minghui and Li, Wanxin and Su, Zichang and Lin, Shixu and Garay, Lucas and Zhang, Zhiyun and Zhang, Yujie and Zeng, Qingcheng and Shen, Jie and Yuan, Changzheng and Yang, Jie},
    journal={NEJM AI},
    volume={1},
    number={6},
    pages={AIra2400012},
    year={2024},
    publisher={Massachusetts Medical Society}
}

If you use the datasets in BRIDGE, please also cite the original paper of datasets, which can be found in our BRIDGE paper.