Microsoft has developed an AI-enabled diagnostic system called the Microsoft AI Diagnostic Orchestrator (MAI-DxO), which, according to a recent experiment, can diagnose complex medical cases with more than four times the accuracy of human doctors. When paired with OpenAI’s o3 model, MAI-DxO achieved 80% diagnostic accuracy—compared to the 20% average of generalist physicians—while also reducing diagnostic costs by 20% relative to physicians and 70% compared to using the o3 model alone. When configured for maximum accuracy, MAI-DxO reached 85.5% accuracy, and these performance improvements generalized across models from OpenAI, Gemini, Claude, Grok, DeepSeek, and Llama.
The Microsoft team evaluated MAI-DxO using 304 real-world case studies from the New England Journal of Medicine. The system not only correctly diagnosed 85.5% of the cases but also used fewer resources than a group of experienced physicians. In the study, 21 practicing physicians from the UK and U.S.—each with five to twenty years of clinical experience—were given the same tasks and achieved a mean accuracy of just 20%. Researchers pointed out that while medical specialists are highly knowledgeable in specific areas, no single doctor can master every complex medical case, whereas AI can draw from across medical disciplines simultaneously.
The team explained that the MAI-Dx Orchestrator effectively transforms any language model into a virtual panel of clinicians: it asks follow-up questions, orders tests, delivers diagnoses, checks costs, and verifies its own reasoning before deciding how to proceed. This advanced reasoning capability, the authors suggested, could fundamentally change how healthcare operates.
However, the researchers acknowledged several limitations of their experiment. The case mix was unrealistic, as it was based on complex, teaching-focused cases from the NEJM and did not include healthy individuals or mild conditions, leaving uncertainty about the AI’s performance in routine, everyday scenarios or its tendency for false positives. The study also lacked real-world constraints such as patient discomfort, wait times, insurance limitations, test availability, and delays in receiving results. Additionally, the cost evaluation relied on simplified U.S. averages without considering regional, institutional, or payer differences. The comparison was also limited to internal and primary care physicians, excluding specialists, and participating doctors were not allowed to use internet resources—though in practice, physicians routinely consult guidelines, colleagues, and other tools.
Despite these limitations, the researchers noted that their findings indicate significant potential for accuracy gains, particularly for clinicians working in remote or under-resourced areas, and offer insight into how language models could augment medical expertise to improve outcomes even in well-resourced settings.