91社区

Portrait of Milan Toma

Not All AI is Built to Diagnose

Kim Campo| March 9, 2026

Pictured: Milan Toma

Artificial intelligence (AI) is rapidly transforming healthcare. AI systems can now detect diabetic eye disease from retinal photos and analyze CT images for signs of early-stage lung cancers and stroke.

Right now, at hospitals across the country and throughout the world, specialized algorithms are quietly assisting physicians, prioritizing urgent scans and flagging subtle irregularities that might otherwise go unnoticed. These specialized AI tools鈥攐ften trained on millions of precisely categorized medical images鈥攁re increasingly integrated into real clinical practice.

At the same time, another form of AI has captured the public鈥檚 attention: large language models (LLMs). These widely accessible systems, such as ChatGPT and Claude, can analyze both text and images. In theory, these capabilities should make them well-suited for medical tasks, but are general-use AI platforms reliable when it comes to medical diagnosis?

A new study led by College of Osteopathic Medicine (NYITCOM) Associate Professor Milan Toma, Ph.D., suggests otherwise. As seen in the scholarly journal , Toma and his co-authors, which include NYITCOM Senior Development Security Operations Engineer Mihir Matalia and medical student Sungjoon Hong, tested the reliability of some of the world鈥檚 most advanced multimodal LLMS (GPT-5, Gemini 3 Pro, Llama 4 Maverick, Grok4, and Claude Opus 4.5 Extended).

The researchers provided each AI model with the same CT brain scan showing clear intracranial pathology. Then, they asked the models to analyze the image like a radiologist鈥攊dentifying the imaging technique used, the location of the pathology in the brain, primary diagnosis, key features, and potential alternative diagnoses. Overall, the findings revealed a 20 percent rate of fundamental diagnostic error across the AI models, along with concerning variabilities in interpretation and assessment.

The researchers used this CT brain scan, showing an ischemic stroke on the left side, as the standardized test case for all five AI models.

At first, the models produced promising results, with all five correctly identifying the image as a CT brain scan. Four models also detected a key finding: an ischemic stroke near the left middle cerebral artery. However, one made a fundamental error by incorrectly misclassifying the stroke as a hemorrhage on the opposite side of the brain. In a real, clinical setting, this error could significantly impact a patient鈥檚 health, as ischemic strokes and hemorrhagic strokes require different treatments.

Even among the four AI models that reached the correct diagnosis, their explanations differed greatly. Some offered varying interpretations on when the stroke first occurred; others disagreed on alternative diagnoses and additional brain regions affected, as well as calcification. The researchers then introduced a novel surprise: They asked each AI model to grade the others鈥 diagnostic explanations. This cross-evaluation exposed additional inconsistencies, with some models grading more harshly than others. One model even believed the findings showed chronic brain abnormalities rather than an acute stroke and, as such, systematically penalized the others鈥 responses.

In recent years, Toma has published more than 30 peer-reviewed studies on AI in medical diagnostics and healthcare, as well as two books on the topic.

鈥淥ur research highlights a critical distinction in the AI landscape. Most successful medical AI tools are task-specific algorithms, trained on large datasets of labeled medical images and validated for very specific diagnostic tasks,鈥 says Toma. 鈥淗owever, large language models are not optimized for diagnostics鈥攖hey are built for linguistics and conversation. Accordingly, they generate explanations that sound authoritative, even when their underlying interpretation is wrong or inconsistent.鈥

Toma and his co-authors concluded that the future of healthcare AI will likely combine both specialized diagnostic systems and language models. However, while LLMs may be useful for clinical documentation, summarizing reports, or communicating with patients, oversight from a medical expert remains a non-negotiable for all diagnostic interpretations.

More News

Portrait of Perry Rosen in white coat

Rediscovering Her Calling

College of Osteopathic Medicine student Perry Rosen is the lead author on a recently published study about pediatric nicotine exposure, but her journey to medical school was not a linear path.

Graphic of migration flow

Examining the Role of Inequality in Human Migration

Mathematical models fall short in their predictions of migration. Associate Professor of Mechanical Engineering Alain Boldini, Ph.D., seeks to improve these models by including conflicts, natural disasters, and economic factors.

Talia Lilikakis and Robert Alexander working at a computer

Student Wins Best Presentation in Puerto Rico

Life sciences/osteopathic medicine student Talia Lilikakis traveled to Puerto Rico for the Annual Meeting of the Society of Thoracic Radiology, where she won Best Student Oral Scientific Presentation.

Portrait of Ariana Falletta

Beyond Science

Doctor of Osteopathic Medicine student and aspiring dermatologist Arianna Falletta believes that beyond science, medicine is about supporting people.

Group of three high school students

Promoting Early Engagement in Research

91社区 recently completed the ninth year of its Mini-Research Grants Awards program to encourage high school students to pursue STEM fields.

Portrait of Hesham Tawfeek

Reversing Bone Loss After Spinal Cord Injury

People with spinal cord injury may lose up to 41 percent of their bone mass in the first year. A new study by the College of Arts and Sciences鈥 Hesham Tawfeek, MBBCh, seeks to repair this damage.