In most doctors’ offices these days, you’ll find a pattern: Everybody’s Googling, all the time. Physicians search for clues to a diagnosis, or for reminders on the best treatment plans. Patients scour WebMD, tapping in their symptoms and doomscrolling a long list of possible problems.
But those constant searches leave something to be desired. Doctors don’t have the time to sift through pages of results, and patients don’t have the knowledge to digest medical research. Everybody has trouble finding the most reliable information.
Optimists believe artificial intelligence could help solve those problems, but the bots might not be ready for prime time. In a recent paper, Dr. Gary Franklin, a University of Washington research professor of environmental & occupational health sciences and of neurology in the UW School of Medicine, described a troubling experience with Google’s Gemini chatbot. When Franklin asked Gemini for information on the outcomes of a specific procedure – a decompressive brachial plexus surgery – the bot gave a detailed answer that cited two medical studies, neither of which existed.
Franklin wrote that it’s “buyer beware when it comes to using AI Chatbots for the purposes of extracting accurate scientific information or evidence-based guidance.” He recommended that AI experts develop specialized chatbots that pull information only from verified sources.
One expert working toward a solution is Lucy Lu Wang, a UW assistant professor in the Information School who focuses on making AI better at understanding and relaying scientific information. Wang has developed tools to extract important information from medical research papers, verify scientific claims, and make scientific images accessible to blind and low-vision readers.
UW News sat down with Franklin and Wang to discuss how AI could enhance health care, what’s standing in the way, and whether there’s a downside to democratizing medical research.
Each of you has studied the possibilities and perils of AI in health care, including the experiences of patients who ask chatbots for medical information. In a best-case scenario, how do you envision AI being used in health and medicine?
Gary Franklin: Doctors use Google a lot, but they also rely on services like UpToDate, which provide really great summaries of medical information and research. Most doctors have zero time and just want to be able to read something very quickly that is well documented. So from a physician’s perspective trying to find truthful answers, trying to make my practice more efficient, trying to coordinate things better — if this technology could meaningfully contribute to any of those things, then it would be unbelievably great.
I’m not sure how much doctors will use AI, but for many years, patients have been coming in with questions about what they found on the internet, like on WebMD. AI is just the next step of patients doing this, getting some guidance about what to do with the advice they’re getting. As an example, if a patient sees a surgeon who’s overly aggressive and says they need a big procedure, the patient could ask an AI tool what the broader literature might recommend. And I have concerns about that.
Lucy Lu Wang: I’ll take this question from the clinician’s perspective, and then from the patient’s perspective.
From the clinician’s perspective, I agree with what Gary said. Clinicians want to look up information very quickly because they’re so taxed and there’s limited time to treat patients. And you can imagine if the tools that we have, these chatbots, were actually very good at searching for information and very good at citing accurately, that they could become a better replacement for a type of tool like UpToDate, right? Because UpToDate is good, it’s human-curated, but it doesn’t always contain the most fine-grained information you might be looking for.
These tools could also potentially help clinicians with patient communication, because there’s not always enough time to follow up or explain things in a way that patients can understand. It’s an add-on part of the job for clinicians, and that’s where I think language models and these tools, in an ideal world, could be really beneficial.
Lastly, on the patient’s side, it would be really amazing to develop these tools that help with patient education and help increase the overall health literacy of the population, beyond what WebMD or Google does. These tools could engage patients with their own health and health care more than before.
Zooming out from the individual to the systemic, do you see any ways AI could make health systems as a whole function more smoothly?
GF: One thing I’m curious about is whether these tools can be used to help with coordination across the health care system and between physicians. It’s horrible. There was a book called “Crossing the Quality Chasm” that argued the main problem in American medicine is poor coordination across specialties, or between primary care and anybody else. It’s still horrible, because there’s no function in the medical field that actually does that. So that’s another question: Is there a role here for this kind of technology in coordinating health care?
LLW: There’s been a lot of work on tools that can summarize a patient’s medical history in their clinical notes, and that could be one way to perform this kind of communication between specialties. There’s another component, too: If patients can directly interact with the system, we can construct a better timeline of the patient’s experiences and how that relates to their clinical medical care.
We’ve done qualitative research with health care seekers that suggests there are lots of types of questions that people are less willing to ask their clinical provider, but much more willing to put into one of these models. So the models themselves are potentially addressing unmet needs that patients aren’t willing to directly share with their doctors.
What’s standing in the way of these best-case scenarios?
LLW: I think there are both technical challenges and socio-technical challenges. In terms of technical challenges, a lot of these models’ training doesn’t currently make them effective for tasks like scientific search and summarization.
First, these current chatbots are mostly trained to be general-purpose tools, so they’re meant to be OK at everything, but not great at anything. And I think there will be more targeted development towards these more specific tasks, things like scientific search with citations that Gary mentioned before. The current training methods tend to produce models that are instruction-following, and have a very large positive response bias in their outputs. That can lead to things like generating answers with citations that support the answer, even if those citations don’t exist in the real world. These models are also trained to be overconfident in their responses. If the way the model communicates is positive and overconfident, then it’s going to lead to lots of problems in a domain like health care.
And then, of course, there’s socio-technical problems, like, maybe these models should be developed with the specific goal of supporting scientific search. People are, in fact, working toward these things and have demonstrated good preliminary results.
GF: So are the folks in your field pretty confident that that can be overcome in a fairly short time?
LLW: I think the citation problem has already been overcome in research demonstration cases. If we, for example, hook up an LLM to PubMed search and allow it only to cite conclusions based on articles that are indexed in PubMed, then actually the models are very faithful to citations that are retrieved from that search engine. But if you use Gemini and ChatGPT, those are not always hooked up to those research databases.
GF: The problem is that a person trying to search using those tools doesn’t know that.
LLW: Right, that’s a problem. People tend to trust these things because, as an example, we now have AI-generated answers at the top of Google search, and people have historically trusted Google search to only index documents that people have written, maybe putting the ones that are more trustworthy at the top. But that AI-generated response can be full of misinformation. What’s happening is that some people are losing trust in traditional search as a consequence. It’s going to be hard to build back that trust, even if we improve the technology.
We’re really at the beginning of this technology. It took a long time for us to develop meaningful resources on the internet — things like Wikipedia or PubMed. Right now, these chatbots are general-purpose tools, but there are already starting to be mixtures of models underneath. And in the future, they’re going to get better at routing people’s queries to the correct expert models, whether that’s to the model hooked up to PubMed or to trusted documents published by various associates related to health care. And I think that’s likely where we’re headed in the next couple of years.
Trust and reliability issues aside, are there any potential downsides to deploying these tools widely? I can see a potential problem with people using chatbots to self-diagnose when it might be preferable to see a provider.
LLW: You think of a resource like WebMD: Was that a net positive or net negative? Before its existence, patients really did have a hard time finding any information at all. And of course, there’s limited face time with clinicians where people actually get to ask those questions. So for every patient who wrongly self-diagnoses on WebMD, there are probably also hundreds of patients who found a quick answer to a question. I think that with these models, it’s going to be similar. They’re going to help address some of the gaps in clinical care where we don’t currently have enough resources.
For more information or to reach the researchers, email Alden Woods at acwoods@uw.edu.