ChatGPT Bot Flunks Gastroenterology Exam

Diana Swift

May 23, 2023

ChatGPT, a popular artificial intelligence language-processing model, failed a gastroenterology self-assessment test several times in a recent study.

Versions 3 and 4 of the chatbot scored only 65% and 62%, respectively, on the American College of Gastroenterology (ACG) Self-Assessment Test. The minimum passing grade is 70%.

Arvind J. Trindade, MD

"You might expect a physician to score 99%, or at least 95%," lead author Arvind J. Trindade, MD, regional director of endoscopy at Northwell Health (Central Region) in New Hyde Park, New York, told Medscape Medical News in an interview.

The study was published online May 22 in the American Journal of Gastroenterology.

Trindade and colleagues undertook the study amid growing reports of students using the tool across many academic areas, including law and medicine, and growing interest in the chatbot's potential in medical education.

"I saw gastroenterology students typing questions into it. I wanted to know how accurate it was in gastroenterology — if it was going to be used in medical education and patient care," said Trindade, who is also an associate professor at Feinstein Institutes for Medical Research in Manhasset, New York. "Based on our research, ChatGPT should not be used for medical education in gastroenterology at this time, and it has a way to go before it should be implemented into the healthcare field."

Poor Showing

The researchers tested the two versions of ChatGPT on both the 2021 and 2022 online ACG Self-Assessment Test, a multiple-choice exam designed to gauge how well a trainee would do on the American Board of Internal Medicine Gastroenterology board examination.

Questions that involved image selection were excluded from the study. For those that remained, the questions and answer choices were copied and pasted directly into ChatGPT, which returned answers and explanations. The corresponding answer was selected on the ACG website based on the chatbot's response.

Of the 455 questions posed, ChatGPT-3 correctly answered 296, and ChatGPT-4 got 284 right. There was no discernible pattern in the type of question that the chatbot answered incorrectly, but questions on surveillance timing for various disease states, diagnosis, and pharmaceutical regimens were all answered incorrectly.

The reasons for the tool's poor performance could lie with the large language model underpinning ChatGPT, the researchers write. The model was trained on freely available information — not specifically on medical literature and not on materials that require paid journal subscriptions — to be a general-purpose interactive program.

Additionally, the chatbot may use information from a variety of sources, including non- or quasi-medical sources, or out-of-date sources, which can lead to errors, they note. ChatGPT-3 was last updated in June 2021 and ChatGPT-4 in September 2021.

"ChatGPT does not have an intrinsic understanding of an issue," Trindade said. "Its basic function is to predict the next word in a string of text to produce an expected response, regardless of whether such a response is factually correct or not."

Previous Research

In a previous study, ChatGPT was able to pass parts of the US Medical Licensing Examination (USMLE).

The chatbot may have performed better on the USMLE because the information tested on the exam may have been more widely available for ChatGPT's language training, Trindade said. "In addition, the threshold for passing [the USMLE] is lower with regard to the percentage of questions correctly answered," he said.

ChatGPT seems to fare better at helping to inform patients than it does on medical exams. The chatbot provided generally satisfactory answers to common patient queries about colonoscopy in one study and about hepatocellular carcinoma and liver cirrhosis in another study.

For ChatGPT to be valuable in medical education, "future versions would need to be updated with medical resources such as journal articles, society guidelines, and medical databases, such as UpToDate," Trindade said. "With directed medical training in gastroenterology, it may be a future tool for education or patient use in this field, but not currently as it is now. Before it can be used in gastroenterology, it should be validated."

That said, he noted, medical education has evolved from being based on textbooks and print journals to include internet-based journal data and practice guidelines on specialty websites. If properly primed, resources such as ChatGPT may be the next logical step.

This study received no funding. Trindade is a consultant for Pentax Medical, Boston Scientific, Lucid Diagnostic, and Exact Science and receives research support from Lucid Diagnostics.

Am J Gastroenterol. Published online May 22, 2023. Abstract

Diana Swift is a freelance medical journalist based in Toronto.

For more news, follow Medscape on Facebook, Twitter, Instagram, and YouTube

Comments

3090D553-9492-4563-8681-AD288FA52ACE
Comments on Medscape are moderated and should be professional in tone and on topic. You must declare any conflicts of interest related to your comments and responses. Please see our Commenting Guide for further information. We reserve the right to remove posts at our sole discretion.

processing....