Generative artificial intelligence in medicine
Emerging generative artificial intelligence (AI) models exhibit remarkable abilities to process and produce text and other media at scale, and also have excellent performance in a wide range of clinical tasks. Harnessing their abilities could improve the accuracy, efficiency, and accessibility of healthcare for patients around the world. For a primer on the technology behind state-of-the-art generative AI and potential applications in clinical settings, see these reviews in Nature Medicine or Ophthalmology Science. For reassurance that doctors are not about to be replaced by computers, see my piece in Journal of the Royal Society of Medicine.
Assessing the clinical potential of LLMs
ChatGPT initially made headlines in the medical world for passing medical school examinations, and we trialled the same model in a test aimed at fully qualified doctors which indicated the early strengths and weaknesses of this technology: JMIR Medical Education.
However, exam results are poor indicators of clinical performance without context. To establish the clinical reasoning and recall ability of flagship LLMs, we recruited ophthalmologists at every stage of training to provide benchmark comparators. We found that the strongest LLM (GPT-4) matched consultant and attending ophthalmologists (E1-E5), indicating expert-level performance. The full study was published in PLOS Digital Medicine.
Chatbot Assessment Reporting Tool (CHART)
An enormous number of published studies try to establish the ability of generative AI to provide useful advice to clinicians and patients. This is a new genre of research, but a lack of reporting guidelines in this growing space mean that many studies do not provide actionable or interpretable information for clinicians and researchers to build upon. In a systematic review of the early literature base (JAMA Network Open), we found that poor reporting is common. For example, 99.3% of studies fail to provide sufficient information to identify the AI model tested.
CHART is a multinational initiative directed towards developing a reporting guideline that caters to this new type of clinical research study. By involving clinicians, computer scientists, researchers, statisticiansm and patients, CHART aims to provide an accessible and comprehensive tool to empower researchers to design and report studies that provide useful information to inform clinical practice and development work. An introduction to the initiative was published in Nature Medicine.
Automating systematic review
Systematic review is the foundation of evidence-based medicine, as well as the gold standard for evidence synthesis across academic disciplines. Many of the processes involved in systematic review are text-based, rules-based, and repetitive. This makes these tasks amenable to automation with large language models (LLMs).
Abstract screening is often one of the most time-consuming exercises involved in systematic review; large numbers of study records must be checked against explicit inclusion/exclusion criteria to whittle down the number of studies that require longer review. Led from Oxford, a multicentre team completed an extensive validation exercise for LLMs designed to automate abstract screening across a representative sample of gold-standard systematic reviews from The Cochrane Library. Our findings (arXiv) indicate that LLMs have excellent potential to improve the efficiency of systematic review, and we demonstrate how best to combine LLM and human researcher decisions 'in series' and 'in parallel'. Our code is freely available so others can make use of our approach!