Some doctors see LLMs as a boon for medical literacy. The typical patient might struggle to navigate the vast landscape of online medical information—and, particularly, to differentiate high-quality sources from polished but factually dubious web sites—but LLMs can do this job for them, at the very least in theory. Treating patients who had searched for his or her symptoms on Google required “plenty of attacking patient anxiety [and] reducing misinformation,” says Marc Succi, an associate professor at Harvard Medical School and a practicing radiologist. But now, he says, “you see patients with a university education, a highschool education, asking questions at the extent of something an early med student might ask.”
The discharge of ChatGPT Health, and Anthropic’s subsequent announcement of latest health integrations for Claude, indicate that the AI giants are increasingly willing to acknowledge and encourage health-related uses of their models. Such uses actually include risks, given LLMs’ well-documented tendencies to agree with users and make up information quite than admit ignorance.
But those risks also must be weighed against potential advantages. There’s an analogy here to autonomous vehicles: When policymakers consider whether to permit Waymo of their city, the important thing metric shouldn’t be whether its cars are ever involved in accidents but whether or not they cause less harm than the established order of counting on human drivers. If Dr. ChatGPT is an improvement over Dr. Google—and early evidence suggests it could be—it could potentially lessen the big burden of medical misinformation and unnecessary health anxiety that the web has created.
Pinning down the effectiveness of a chatbot resembling ChatGPT or Claude for consumer health, nonetheless, is difficult. “It’s exceedingly difficult to guage an open-ended chatbot,” says Danielle Bitterman, the clinical lead for data science and AI on the Mass General Brigham health-care system. Large language models rating well on medical licensing examinations, but those exams use multiple-choice questions that don’t reflect how people use chatbots to look up medical information.
Sirisha Rambhatla, an assistant professor of management science and engineering on the University of Waterloo, attempted to shut that gap by evaluating how GPT-4o responded to licensing exam questions when it didn’t have access to a listing of possible answers. Health workers who evaluated the responses scored only about half of them as entirely correct. But multiple-choice exam questions are designed to be tricky enough that the reply options don’t give them entirely away, they usually’re still a fairly distant approximation for the kind of thing that a user would type into ChatGPT.
A different study, which tested GPT-4o on more realistic prompts submitted by human volunteers, found that it answered medical questions appropriately about 85% of the time. Once I spoke with Amulya Yadav, an associate professor at Pennsylvania State University who runs the Responsible AI for Social Emancipation Lab and led the study, he made it clear that he wasn’t personally a fan of patient-facing medical LLMs. But he freely admits that, technically speaking, they appear as much as the duty—in any case, he says, human doctors misdiagnose patients 10% to fifteen% of the time. “If I take a look at it dispassionately, evidently the world is gonna change, whether I prefer it or not,” he says.
For people in search of medical information online, Yadav says, LLMs do appear to be a better option than Google. Succi, the radiologist, also concluded that LLMs could be a higher alternative to web search when he compared GPT-4’s responses to questions on common chronic medical conditions with the knowledge presented in Google’s knowledge panel, the knowledge box that sometimes appears on the best side of the search results.
Since Yadav’s and Succi’s studies appeared online, in the primary half of 2025, OpenAI has released multiple latest versions of GPT, and it’s reasonable to expect that GPT-5.2 would perform even higher than its predecessors. However the studies do have essential limitations: They concentrate on straightforward, factual questions, they usually examine only transient interactions between users and chatbots or web search tools. Among the weaknesses of LLMs—most notably their sycophancy and tendency to hallucinate—is perhaps more prone to rear their heads in additional extensive conversations and with people who find themselves coping with more complex problems. Reeva Lederman, a professor on the University of Melbourne who studies technology and health, notes that patients who don’t just like the diagnosis or treatment recommendations that they receive from a health care provider might search out one other opinion from an LLM—and the LLM, if it’s sycophantic, might encourage them to reject their doctor’s advice.
Some studies have found that LLMs will hallucinate and exhibit sycophancy in response to health-related prompts. For instance, one study showed that GPT-4 and GPT-4o will happily accept and run with incorrect drug information included in a user’s query. In one other, GPT-4o incessantly concocted definitions for fake syndromes and lab tests mentioned within the user’s prompt. Given the abundance of medically dubious diagnoses and coverings floating across the web, these patterns of LLM behavior could contribute to the spread of medical misinformation, particularly if people see LLMs as trustworthy.
OpenAI has reported that the GPT-5 series of models is markedly less sycophantic and susceptible to hallucination than their predecessors, so the outcomes of those studies may not apply to ChatGPT Health. The corporate also evaluated the model that powers ChatGPT Health on its responses to health-specific questions, using their publicly available HeathBench benchmark. HealthBench rewards models that express uncertainty when appropriate, recommend that users seek medical attention when vital, and refrain from causing users unnecessary stress by telling them their condition is more serious that it truly is. It’s reasonable to assume that the model underlying ChatGPT Health exhibited those behaviors in testing, though Bitterman notes that among the prompts in HealthBench were generated by LLMs, not users, which could limit how well the benchmark translates into the true world.
