OpenAI has unveiled HealthBench, a comprehensive benchmark designed to evaluate artificial intelligence models in healthcare settings using real-world conversations and physician judgment. The benchmark includes 5,000 simulated medical interactions designed to test how AI systems respond to healthcare queries.
"The 5,000 conversations in HealthBench simulate interactions between AI models and individual users or clinicians. The task for a model is to provide the best possible response to the user's last message," OpenAI stated in their announcement.
The benchmark was developed in collaboration with 262 physicians from 60 countries, representing 26 medical specialties and proficiency in 49 languages. Each conversation in the dataset comes with a physician-created rubric containing specific criteria to evaluate AI responses, totaling 48,562 unique evaluation points.
OpenAI explained that these conversations were created through "synthetic generation and human adversarial testing" and span various medical contexts. The evaluation process is rigorous: "Every model response is graded against a set of physician-written rubric criteria specific to that conversation," with each criterion weighted according to physician judgment of its importance.
The benchmark uses GPT-4.1 to score responses against these criteria, generating an overall score that reflects how well the model meets physician expectations. HealthBench covers seven core themes: expertise-tailored communication, response depth, emergency referrals, health data tasks, global health, responding under uncertainty, and context seeking.
"Evaluations like HealthBench are part of our ongoing efforts to understand model behavior in high-impact settings and help ensure progress is directed toward real-world benefit," OpenAI noted. Their initial findings indicate that while large language models have improved significantly and can outperform experts in some benchmark examples, "even the most advanced systems still have substantial room for improvement," particularly in seeking context for unclear queries.
The tools have been made publicly available on GitHub to support broader research and development in healthcare AI safety.
This development comes amid OpenAI CEO Sam Altman's involvement in Project Stargate, a $500 billion initiative focused on AI infrastructure announced earlier this year by President Donald Trump. However, recent reports suggest the project faces delays due to economic uncertainty and tariff concerns affecting data center costs.
Click here for the original news story.