By Mary Carpenter
“ChatGPT . . . interacts in a conversational way,” according to the blog from its creator, OpenAI. “The dialogue format makes it possible for ChatGPT to answer followup questions, admit its mistakes, challenge incorrect premises, and reject inappropriate requests.” (ChatGPT-4 is the more expensive advanced version used in medical settings.)
DEBILITATING pain and other symptoms in 12-year-old Alex (last name withheld for privacy)—with no diagnosis after three years and visits to 17 doctors—sent Alex’s mother searching via ChatGPT. “No matter how many doctors . . . specialists would only address their individual areas of expertise,” according to Today; and from the mom: “Nobody will even give you a clue about what the diagnosis could be.”
At the heart of most medical mysteries is seeking a diagnosis, consulting one specialist after another before finding a doctor who can figure out the problem. That’s an obvious opportunity for ChatGPT-4 because it can access so much information, “can process our natural human language and generate a response,” according to the Lighthouse Guild. As with AI’s other uses in medicine—reading scans, educating medical students—the technology has risks and limitations and still requires physician involvement at almost every step.
In 2019, the difficulties of coordinating expertise from a range of specialists inspired the Netflix series Diagnosis, based on the New York Times column of the same name by Yale internist Lisa Sanders. The first episode described 10 years of debilitating symptoms, including fatigue, in the patient, Angel; the hypothesis, sent from a medical student in Turin, Italy, that advised metabolic gene testing; Angel’s trip to Italy for tests; and finally the diagnosis of a rare muscle-weakness syndrome, leading to successful treatment.
Muscle weakness was also the diagnosis for a new disease, given the name Mitchell syndrome for the 12-year old athlete who had won a coveted slot for consideration by NIH’s Undiagnosed Diseases Network (UDN). Started in 2013, the UDN has accepted 1,500 out of 4,500 applicants, succeeded in making diagnoses for 30% and in the process identified 50 new disorders.
“Why can’t I get a diagnosis?” was the desperate question DC area 5K runner Nancy Chiancone typed into her computer. For Chiancone, the Undiagnosed Diseases Program (within the UDN) found an answer in just five weeks. But the NIH program, inaccessible to so many patients, has recently lost funding.
Diagnosis is the number one “pro” of medical uses for AI technology, according to FutureLearn. The technology’s ability to analyze data “much faster than humans are able to, and often more accurately . . . can help medical professionals reach a diagnosis a lot more quickly.”
High on the list of “cons,” in addition to security risks, are risks of inaccuracies. If struggling to find an answer, ChatGPT can hallucinate, or fabricate—for example, listing research paper titles that sound real, along with authors who may have written about the topic, when these papers may not exist. Practitioners will need a time-consuming learning curve to recognize these errors.
Another example of unreliability, ChatGPT may come up with a different answer each time the same question is posed—because the technology produces “text from scratch by gleaning from its datasets and creating a new answer,” according to Cnet. In this way, chatbots are different from traditional search engines like Google, “which aim to elevate the most relevant links.”
In addition, AI fails to consider “social, historical, and economic factors [that] can also influence the specific care an individual needs,” according to FutureLearn. “While AI is more than capable of allocating treatment based on the diagnosis, it isn’t yet capable of considering other social variables that may influence a medical professional’s decision.”
Training poses additional challenges—both training of medical staff to use the technology and training of the “AI tools themselves . . . with curated data in order to perform properly,” according to the site. However, for a different kind of training, that of medical students, ChatGPT-4 has proved especially useful. As Beth Israel Deaconess internist Adam Rodman told the New York Times, “Doctors are terrible at teaching other doctors how we think.”
Experienced doctors often use pattern recognition to make a diagnosis, explained Rodman: “In medicine, it’s called an illness script: signs, symptoms and test results that doctors put together to tell a coherent story based on similar cases they know about or have seen themselves. If the illness script doesn’t help, doctors turn to other strategies, like assigning probabilities to various diagnoses that might fit. [And ChatGPT-4 can] create something that is remarkably similar to an illness script.”
AI can also perform “self-supervised learning.” According to Nature, with retinal scans used to detect eye disease—or risks for other health conditions, including heart failure and Parkinson’s disease—the AI tool RETFound can analyze millions of retinal images to teach itself to detect abnormalities in small vessels that are linked to disease.
For medical scans (X-rays, MRIs), ChatGPT-4 can improve “the accuracy and efficiency of radiological diagnoses by reducing interpretation variability and errors,” according to one review. Because medical scan readings are notoriously both dependent on individual radiologists’ expertise and susceptible to scheduling delays, ChatGPT’s “ability to analyze and interpret medical images in real time can help reduce interpretation variability and errors, improve workflow efficiency, and ultimately enhance patient outcomes.”
One study reported in the Journal of the American Medical Association (JAMA) found ChatGPT-4 “did better than most doctors on weekly diagnostic challenges [that had been] published in The New England Journal of Medicine,” Rodman said. But researchers at Mass General Brigham who assessed “an entire clinical encounter with a patient [from first evaluation to] final diagnosis” determined ChatGPT’s success rate at 71.7%, “at the level of someone who has just graduated from medical school, such as an intern or resident.”
That 71% success rate, based on comparisons with diagnoses made by the country’s top doctors, however, is a very different challenge than evaluations of patients who have no diagnosis at all. For the young patient Alex, Michigan pediatric neurosurgeon Holly Gilmer explained that his eventual diagnosis—tethered-cord syndrome, caused when spinal cord tissue has formed attachments to the spine— can be missed because young children have trouble describing their symptoms. But after Alex’s mother typed in every one of Alex’s medical evaluation reports, ChatGPT quickly found the right answer.
For me, a lifetime of reading medical mysteries—often endless tales of patients who suffer from terrible symptoms while frantically searching for a diagnosis—offered motivation to approach the complicated high-tech world of AI. After investigating uses of ChatGPT in medicine, I listened to a podcast of the Wired article “What OpenAI really wants,” and understood enough to find it fascinating—though also to worry how much AI will succeed in truly improving medical care for patients everywhere.
—Mary Carpenter regularly reports on topical subjects in health and medicine.