Google’s Latest AI System Outperforms Physicians in Complex Diagnoses

going to the doctor with a baffling set of symptoms. Getting the proper diagnosis quickly is crucial, but sometimes even experienced physicians face challenges piecing together the puzzle. Sometimes it won’t be something serious in any respect; others a deep investigation is likely to be required. No wonder AI systems are making progress here, as we’ve got already seen them assisting increasingly an increasing number of on tasks that require pondering over documented patterns. But Google just seems to have taken a really strong leap within the direction of creating “AI doctors” actually occur.

AI’s “intromission” into medicine isn’t entirely recent; algorithms (including many AI-based ones) have been aiding clinicians and researchers in tasks comparable to image evaluation for years. We more recently saw anecdotal and likewise some documented evidence that AI systems, particularly Large Language Models (LLMs), can assist doctors of their diagnoses, with some claims of nearly similar accuracy. But on this case it’s all different, because the brand new work from Google Research introduced an LLM specifically trained on datasets relating observations with diagnoses. While this is simply a start line and lots of challenges and considerations lie ahead as I’ll discuss, the actual fact is obvious: a strong recent AI-powered player is entering the world of medical diagnosis, and we higher get prepared for it. In this text I’ll mainly concentrate on how this recent system works, calling out along the best way various considerations that arise, some discussed in Google’s paper in Nature and others debated within the relevant communities — i.e. medical doctors, insurance firms, policy makers, etc.

Meet Google’s Latest Superb AI System for Medical Diagnosis

The appearance of sophisticated LLMs, which as you surely know are AI systems trained on vast datasets to “understand” and generate human-like text, is representing a considerable upshift of gears in how we process, analyze, condense, and generate information (at the top of this text I posted another articles related to all that — go check them out!). The newest models particularly bring a brand new capability: engaging in nuanced, text-based reasoning and conversation, making them potential partners in complex cognitive tasks like diagnosis. In reality, the brand new work from Google that I discuss here is “just” another point in a rapidly growing field exploring how these advanced AI tools can understand and contribute to clinical workflows.

The study we’re looking into here was published in peer-reviewed form in the celebrated journal , sending ripples through the medical community. Of their article “Towards accurate differential diagnosis with large language models” Google Research presents a specialized form of LLM called AMIE after , trained specifically with clinical data with the goal of assisting medical diagnosis and even running fully autonomically. The authors of the study tested AMIE’s ability to generate a listing of possible diagnoses — what doctors call a “differential diagnosis” — for tons of of complex, real-world medical cases published as difficult case reports.

Here’s the paper with full technical details:

https://www.nature.com/articles/s41586-025-08869-4

The Surprising Results

The findings were striking. When AMIE worked alone, just analyzing the text of the case reports, its diagnostic accuracy was significantly higher than that of experienced physicians working without assistance! AMIE included the proper diagnosis in its top-10 list almost 60% of the time, in comparison with about 34% for the unassisted doctors.

Very intriguingly, and in favor of the AI system, AMIE alone barely outperformed doctors who were assisted by AMIE itself! While doctors using AMIE improved their accuracy significantly in comparison with using standard tools like Google searches (reaching over 51% accuracy), the AI by itself still edged them out barely on this specific metric for these difficult cases.

One other “point of awe” I find is that on this study comparing AMIE to human experts, the AI system only analyzed the text-based descriptions from the case reports used to check it. Nonetheless, the human clinicians had access to the complete reports, that is identical text descriptions available to AMIE plus images (like X-rays or pathology slides) and tables (like lab results). The indisputable fact that AMIE outperformed unassisted clinicians even without this multimodal information is on one side remarkable, and on one other side underscores an obvious area for future development: integrating and reasoning over multiple data types (text, imaging, possibly also raw genomics and sensor data) is a key frontier for medical AI to really mirror comprehensive clinical assessment.

AMIE as a Super-Specialized LLM

So, how does an AI like AMIE achieve such impressive results, performing higher than human experts a few of whom might need years diagnosing diseases?

At its core, AMIE builds upon the foundational technology of LLMs, much like models like GPT-4 or Google’s own Gemini. Nonetheless, AMIE isn’t only a general-purpose chatbot with medical knowledge layered on top. It was specifically optimized for clinical diagnostic reasoning. As described in additional detail within the Nature paper, this involved:

Specialized training data: Effective-tuning the bottom LLM on an enormous corpus of medical literature that features diagnoses.
Instruction tuning: Training the model to follow specific instructions related to generating differential diagnoses, explaining its reasoning, and interacting helpfully inside a clinical context.
Reinforcement Learning from Human Feedback: Potentially using feedback from clinicians to further refine the model’s responses for accuracy, safety, and helpfulness.
Reasoning Enhancement: Techniques designed to enhance the model’s ability to logically connect symptoms, history, and potential conditions; much like those used through the reasoning steps in very powerful models comparable to Google’s own Gemini 2.5 Pro!

Note that the paper itself indicates that AMIE outperformed GPT-4 on automated evaluations for this task, highlighting the advantages of domain-specific optimization. Notably too, but negatively, the paper doesn’t compare AMIE’s performance against other general LLMs, not even Google’s own “smart” models like Gemini 2.5 Pro. That’s quite disappointing, and I can’t understand how the reviewers of this paper missed this!

Importantly, AMIE’s implementation is designed to support interactive usage, in order that clinicians could ask it inquiries to probe its reasoning — a key difference from regular diagnostic systems.

Measuring Performance

Measuring performance and accuracy within the produced diagnoses isn’t trivial, and is interesting for you reader with a Data Science mindset. Of their work, the researchers didn’t just assess AMIE in isolation; somewhat they employed a randomized controlled setup whereby AMIE was compared against unassisted clinicians, clinicians assisted by standard search tools (like Google, PubMed, etc.), and clinicians assisted by AMIE itself (who could also use search tools, though they did so less often).

The evaluation of the info produced within the study involved multiple metrics beyond easy accuracy, most notably the top-n accuracy (which asks: was the proper diagnosis in the highest 1, 3, 5, or 10?), quality scores (how close was the list to the ultimate diagnosis?), appropriateness, and comprehensiveness — the latter two rated by independent specialist physicians blinded to the source of the diagnostic lists.

This wide evaluation provides a more robust picture than a single accuracy number; and the comparison against each unassisted performance and standard tools helps quantify the actual added value of the AI.

Why Does AI Achieve this Well at Diagnosis?

Like other specialized medical AIs, AMIE was trained on vast amounts of medical literature, case studies, and clinical data. These systems can process complex information, discover patterns, and recall obscure conditions far faster and more comprehensively than a human brain juggling countless other tasks. AMIE, in particualr, was specifically optimized for the type of reasoning doctors use when diagnosing, akin to other reasoning models but on this cases specialized for gianosis.

For the particularly tough “diagnostic puzzles” utilized in the study (sourced from the celebrated ), AMIE’s ability to sift through possibilities without human biases might give it an edge. As an observer noted within the vast discussion that this paper triggered over social media, it’s impressive that AI excelled not only on easy cases, but additionally on some quite difficult ones.

AI Alone vs. AI + Doctor

The finding that AMIE alone barely outperformed the AMIE-assisted human experts is puzzling. Logically, adding a talented doctor’s judgment to a strong AI should yield the most effective results (as previous studies with have shown, in truth). And indeed, doctors with AMIE did significantly higher than doctors without it, producing more comprehensive and accurate diagnostic lists. But AMIE alone worked barely higher than doctors assisted by it.

Why the slight edge for AI alone on this study? As highlighted by some health workers over social media, this small difference probably doesn’t mean that doctors make the AI worse or the opposite way around. As a substitute, it probably suggests that, not being accustomed to the system, the doctors haven’t yet discovered the most effective technique to collaborate with AI systems that possess more raw analytical power than humans for specific tasks and goals. This, identical to we won’t be interacting perfecly with an everyday LLM when we’d like its help.

Again paralleling thoroughly how we interact with regular LLMs, it would well be that doctors initially stick too closely to their very own ideas (an “anchoring bias”) or that they have no idea easy methods to best “interrogate” the AI to get probably the most useful insights. It’s all a brand new type of teamwork we’d like to learn — human with machine.

Hold On — Is AI Replacing Doctors Tomorrow?
Absolutely not, after all. And it’s crucial to grasp the constraints:

Diagnostic “puzzles” vs. real patients: The study presenting AMIE used written case reports, that’s condensed, pre-packaged information, very different from the raw inputs that doctors have during their interactions with patients. Real medicine involves talking to patients, understanding their history, performing physical exams, interpreting non-verbal cues, constructing trust, and managing ongoing care — things AI cannot do, a minimum of yet. Medicine even involves human connection, empathy, and navigating uncertainty, not only processing data. Think for instance of placebo effects, ghost pain, physical tests, etc.
AI isn’t perfect: LLMs can still make mistakes or “hallucinate” information, a significant problem. So even when AMIE were to be deployed (which it won’t!), it might need very close oversight from expert professionals.
This is only one specific task: Generating a diagnostic list is only one a part of a physician’s job, and the remainder of the visit to a physician after all has many other components and stages, none of them handled by such a specialized system and potentially very difficult to realize, for the explanations discussed.

Back-to-Back: Towards conversational diagnostic artificial intelligence

Much more surprisingly, in the identical issue of following the article on AMIE, Google Research published one other paper showing that in diagnostic conversations (that isn’t just the evaluation of symptoms but actual dialogue between the patient and the doctor or AMIE) the model ALSO outperforms physicians! Thus, someway, while the previous paper found an objectively higher diagnosis by AMIE, the second paper shows a greater communication of the outcomes with the patient (when it comes to quality and empathy) by the AI system!

And the outcomes aren’t by a small margin: In 159 simulated cases, specialist physicians rated the AI superior to primary care physicians on 30 out of 32 metrics, while test patients preferred the AMIE on 25 of 26 measures.

This second paper is here:

https://www.nature.com/articles/s41586-025-08866-7

Seriously: Medical Associations Must Pay Attention NOW

Despite the various limitations, this study and others prefer it are a loud call. Specialized AI is rapidly evolving and demonstrating capabilities that may augment, and in some narrow tasks, even surpass human experts.

Medical associations, licensing boards, educational institutions, policy makers, insurances, and why not everybody on this world which may potentially be the topic of an AI-based health investigation, have to get acquainted with this, and the subject mist be place high on the agenda of governments.

AI tools like AMIE and future ones could help doctors diagnose complex conditions faster and more accurately, potentially improving patient outcomes, especially in areas lacking specialist expertise. It may additionally help to quickly diagnose and dismiss healthy or low-risk patients, reducing the burden for doctors who must evaluate more serious cases. In fact all this might improve the probabilities of solving health issues for patients with more complex problems, concurrently it lowers costs and waiting times.

Like in lots of other fields, the role of the physician will evolve, ultimately because of AI. Perhaps AI could handle more initial diagnostic heavy lifting, freeing up doctors for patient interaction, complex decision-making, and treatment planning — potentially also easing burnout from excessive paperwork and rushed appointments, as some hope. As someone noted on social media discussions of this paper, not every doctor finds it pleasnt to satisfy 4 or more patients an hour and doing all of the associated paperwork.

As a way to move forward with the inminent application of systems like AMIE, we’d like guidelines. How should these tools be integrated safely and ethically? How can we ensure patient safety and avoid over-reliance? Who’s responsible when an AI-assisted diagnosis is mistaken? No one has clear, consensual answers to those questions yet.

In fact, then, doctors must be trained on easy methods to use these tools effectively, understanding their strengths and weaknesses, and learning what is going to essentially be a brand new type of human-AI collaboration. This development may have to occur with medical professionals on board, not by imposing it to them.

Last, because it at all times comes back to the table: how can we ensure these powerful tools don’t worsen existing health disparities but as a substitute help bridge gaps in access to expertise?

Conclusion

The goal isn’t to switch doctors but to empower them. Clearly, AI systems like AMIE offer incredible potential as highly knowledgeable assistants, in on a regular basis medicine and particularly in complex settings comparable to in areas of disaster, during pandemics, or in distant and isolated places comparable to overseas ships and space ships or extraterrestrial colonies. But realizing that potential safely and effectively requires the medical community to have interaction proactively, critically, and urgently with this rapidly advancing technology. The longer term of diagnosis is probably going AI-collaborative, so we’d like to start out determining the foundations of engagement today.

References

The article presenting AMIE:

Towards accurate differential diagnosis with large language models

And here the outcomes of AMIE evaluation by test patients:

Towards conversational diagnostic artificial intelligence

Google’s Latest AI System Outperforms Physicians in Complex Diagnoses

Meet Google’s Latest Superb AI System for Medical Diagnosis

The Surprising Results