Following my recent post on whether or not AI will replace pathologists and radiologists, I thought it would be a nice segue to talk about how that AI-generated advice is perceived by clinicians, and to combine it with the new “5-Minute Paper” format. Setting aside inaccurate algorithm performance, there is a risk that doctors will simply “turn off their brains” and trust the automated advice, assuming it to be superior to human experience or intuition. So today I’m going to walk-through a recent study that tested this possibility.
The Study:
Gaube, S., Suresh, H., Raue, M. et al. Do as AI say: susceptibility in deployment of clinical decision-aids. npj Digit. Med. 4, 31 (2021)
What is this study about?
This study provided doctors with x-rays and diagnostic advice labeled as from AI or a human and evaluated how the perceived source impacted the clinician’s assessment of the quality and reliability of that information. Researchers also wanted to test any impact of the advice on their diagnostic accuracy. This line from the paper’s Introduction sums up the key issue:
“AI systems will only be able to provide real clinical benefit if the physicians using them are able to balance trust and skepticism. If physicians do not trust the technology, they will not use it, but blind trust in the technology can lead to medical errors.”
Who did the research?
This was a multi-institutional effort by computer scientists, physicians, and psychologists across the world at Boston Medical Center, Beth Israel Medical Center, MIT, LMU Munich, the University of Regensburg (Germany), and the University of Toronto (Canada).
Who paid for it?
The paper itself does not declare any funding source or financial conflicts of interest. The acknowledgements list one author as supported by a research scholarship from the Konrad-Adenauer-Foundation, while another received multiple funding streams including from Microsoft Research, a Canada Research Council Chair, NSERC Discovery Grant, and a CIFAR AI Chair at the Vector Institute.
What did they do?
Two groups of physicians from the USA and Canada were shown 8 cases of x-rays with a summary of findings and diagnosis as depicted in Figure 1 above
The “high-expertise” group was composed of 138 radiologists
The “low-expertise” group was composed of 127 physicians trained in internal medicine / emergency medicine
All x-ray images, case vignettes, radiographic findings and diagnoses were selected and edited by a panel of 3 expert radiologists
Advice was labeled as either from a human or AI system as follows:
AI: “The findings and primary diagnoses were generated by CHEST-AI, a well-trained, deep-learning-based artificial intelligence (AI) model with a performance record (regarding diagnostic sensitivity and specificity) on par with experts in the field”
Human: “The findings and primary diagnoses were generated by Dr. S. Johnson, an experienced radiologist with a performance record (regarding diagnostic sensitivity and specificity) on par with experts in the field”
Participants were asked to view the case information and DICOM images in a web platform, rate the quality of the advice provided, and render their own diagnosis
This experiment was pre-registered and complied with all applicable ethics and regulatory requirements
Data availability:
All study data and analyses are open source and available here
What did they find?
Both experts and non-experts rated accurate advice as high-quality; only radiologists rated the inaccurate advice as significantly lower quality
Radiologists rated advice labeled from humans as higher quality than “AI” advice overall regardless of accuracy (“algorithmic aversion”), while there was no significant difference in the non-expert group
Both groups’ diagnoses were significantly more accurate when given accurate advice than inaccurate. However, this did not significantly vary based on human vs AI source
The non-task expert group of IM/ER doctors were more susceptible to inaccurate advice than radiologists were (42% vs 28%), regardless of source
The diagnostic accuracy also varied substantially depending on the specifics of the case (see Figure 5 below).
Some cases like 3 and 8 exploited known pitfalls in radiographic interpretation and had very wide gaps in performance
What are the take home messages?
The study authors concluded that inaccurate diagnostic advice from any source (AI or human) had the risk to bias people with less expertise, while task experts were less susceptible to incorrect advice (and also showed evidence of “algorithmic aversion”):
“This observed over-reliance has important implications for automated advice systems. While physicians are currently able to ask for advice from colleagues, they typically ask for advice after their initial review of the case. Clinical support systems based on AI or more traditional methods could prime physicians to search for confirmatory information in place of conducting a thorough and critical evaluation. If the underlying model has a higher diagnostic accuracy than the physicians using it, patient outcomes may improve overall. However, for high-risk settings like diagnostic decision-making, over-reliance on advice can be dangerous and steps should be taken to minimize it, especially when the advice is inaccurate”
In my opinion, this study does a great job showing proof-of-concept in the risk of over-reliance on automated clinical decision support systems and should inform their design and implementation in the future. I would be very interested to see similar studies for bloodwork, biopsies, and EKGs; it seems likely that a similar phenomenon would play out.