Singhal, the OpenAI health lead, notes that the corporate’s current GPT-5 series of models, which had not yet been released when the unique HealthBench study was conducted, do a a lot better job of soliciting additional information than their predecessors. Nevertheless, OpenAI has reported that GPT-5.4, the present flagship, is definitely worse at looking for context than GPT-5.2, an earlier version.
Ideally, Bean says, health chatbots could be subjected to controlled tests with human users, as they were in his study, before being released to the general public. That could be a heavy lift, particularly given how briskly the AI world moves and the way long human studies can take. Bean’s own study used GPT-4o, which got here out almost a 12 months ago and is now outdated.
Earlier this month, Google released a study that meets Bean’s standards. Within the study, patients discussed medical concerns with the corporate’s Articulate Medical Intelligence Explorer (AMIE), a medical LLM chatbot that is just not yet available to the general public, before meeting with a human physician. Overall, AMIE’s diagnoses were just as accurate as physicians’, and not one of the conversations raised major safety concerns for researchers.
Despite the encouraging results, Google isn’t planning to release AMIE anytime soon. “While the research has advanced, there are significant limitations that have to be addressed before real-world translation of systems for diagnosis and treatment, including further research into equity, fairness, and safety testing,” wrote Alan Karthikesalingam, a research scientist at Google DeepMind, in an email. Google did recently reveal that Health100, a health platform it’s constructing in partnership with CVS, will include an AI assistant powered by its flagship Gemini models, though that tool will presumably not be intended for diagnosis or treatment.
Rodman, who led the AMIE study with Karthikesalingam, doesn’t think such extensive, multiyear studies are necessarily the best approach for chatbots like ChatGPT Health and Copilot Health. “There’s numerous reasons that the clinical trial paradigm doesn’t at all times work in generative AI,” he says. “And that’s where this benchmarking conversation is available in. Are there benchmarks [from] a trusted third party that we are able to agree are meaningful, that the labs can hold themselves to?”
They key there’s “third party.” Irrespective of how extensively firms evaluate their very own products, it’s tough to trust their conclusions completely. Not only does a third-party evaluation bring impartiality, but when there are numerous third parties involved, it also helps protect against blind spots.
OpenAI’s Singhal says he’s strongly in favor of external evaluation. “We try our greatest to support the community,” he says. “A part of why we put out HealthBench was actually to present the community and other model developers an example of what a superb evaluation looks like.”
Given how expensive it’s to supply a high-quality evaluation, he says, he’s skeptical that any individual academic laboratory would give you the option to supply what he calls “the one evaluation to rule all of them.” But he does speak highly of efforts that academic groups have made to bring preexisting and novel evaluations together into comprehensive evaluations suites—comparable to Stanford’s MedHELM framework, which tests models on a wide selection of medical tasks. Currently, OpenAI’s GPT-5 holds the best MedHELM rating.
