Emerging generative artificial intelligence (AI) models exhibit remarkable abilities to process and produce text and other media at scale, and also have excellent performance in a wide range of clinical tasks. Harnessing their abilities could improve the accuracy, efficiency, and accessibility of healthcare for patients around the world. For a primer on the technology behind state-of-the-art generative AI and potential applications in clinical settings, see these reviews in Nature Medicine or Ophthalmology Science. For reassurance that doctors are not about to be replaced by computers, see my piece in Journal of the Royal Society of Medicine.
ChatGPT initially made headlines in the medical world for passing medical school examinations, and we trialled the same model in a test aimed at fully qualified doctors which indicated the early strengths and weaknesses of this technology: JMIR Medical Education.
However, exam results are poor indicators of clinical performance without context. To establish the clinical reasoning and recall ability of flagship LLMs, we recruited ophthalmologists at every stage of training to provide benchmark comparators. We found that the strongest LLM (GPT-4) matched consultant and attending ophthalmologists (E1-E5), indicating expert-level performance. The full study was published in PLOS Digital Medicine.
An enormous number of published studies try to establish the ability of generative AI to provide useful advice to clinicians and patients. This is a new genre of research, but a lack of reporting guidelines in this growing space mean that many studies do not provide actionable or interpretable information for clinicians and researchers to build upon. In a systematic review of the early literature base (JAMA Network Open), we found that poor reporting is common. For example, 99.3% of studies fail to provide sufficient information to identify the AI model tested.
CHART is a multinational initiative directed towards developing a reporting guideline that caters to this new type of clinical research study. By involving clinicians, computer scientists, researchers, statisticiansm and patients, CHART aims to provide an accessible and comprehensive tool to empower researchers to design and report studies that provide useful information to inform clinical practice and development work. An introduction to the initiative was published in Nature Medicine.
Systematic review is the foundation of evidence-based medicine, as well as the gold standard for evidence synthesis across academic disciplines. Many of the processes involved in systematic review are text-based, rules-based, and repetitive. This makes these tasks amenable to automation with large language models (LLMs).
Abstract screening is often one of the most time-consuming exercises involved in systematic review; large numbers of study records must be checked against explicit inclusion/exclusion criteria to whittle down the number of studies that require longer review. Led from Oxford, a multicentre team completed an extensive validation exercise for LLMs designed to automate abstract screening across a representative sample of gold-standard systematic reviews from The Cochrane Library. Our findings (Journal of the American Medical Informatics Association) indicate that LLMs have excellent potential to improve the efficiency of systematic review, and we demonstrate how best to combine LLM and human researcher decisions 'in series' and 'in parallel'. Our code is freely available so others can make use of our approach!
A growing burden of increasing scan requests is placing increasing pressure on radiologists who are in scarce supply around the world, including within higher income countries. With researchers from University of Birmingham, Queen Mary University of London, Guangdong Institute of Technology, and Meta, we have developed μ²LLM: a small multimodal large language model that integrates guided questions to preserve critical details when interpreting 3-dimensional computed tomography (CT) scans. Model training is enhanced with direct preference optimisation (DPO), as used to develop early reasoning models such as DeepSeek-R1.
μ²LLM consistently outforms larger baseline models across multiple CT datasets, and its small size makes it feasible to deploy locally without having to share sensitive imaging data with external server-providers or model hosts. The model is multilingual, with consistent performance when reporting scans in English and in Mandarin Chinese.
Our findings will be presented at MICCAI2025.
As AI applications proliferate in healthcare, it is imperative that patients are safeguarded from new risks. However, regulation should not act to stymie innovation and progress, necessitating careful balancing of stakeholders' priorities. A Partnership for Oversight, Leadership, and Accountability in Regulating Intelligent Systems–Generative Models in Medicine (POLARIS-GM) has been set up to bring technical and ethical expertise together to help regulators navigate the rapidly evolving landscape of generative AI. Our initial statement can be found in Nature Medicine.