HealthBench Consensus

Paper

Progress Over Time

Interactive timeline showing model performance evolution on HealthBench Consensus

State-of-the-art frontier
Open
Proprietary

HealthBench Consensus Leaderboard

1 models
ContextCostLicense
1400K$5.00 / $30.00
Notice missing or incorrect data?
About this benchmark

What is HealthBench Consensus?

HealthBench Consensus is a HealthBench subset focused on questions where physician-created rubric criteria have especially high agreement, measuring healthcare performance and safety on consensus-evaluable conversations.

HealthBench Consensus is a text benchmark evaluating models on healthcare tasks. LLM Stats tracks 1 models on this benchmark, scored on a 0–1 scale. The current average is 0.9, with the leader at 0.9.

Compare leaders on the best AI for healthcare leaderboards.

Current leaders

GPT-5.5 Instant from OpenAI currently leads the HealthBench Consensus leaderboard with a score of 0.947 across 1 evaluated AI models.

1GPT-5.5 InstantOpenAI94.7%

Source paper

Title
HealthBench: Evaluating Large Language Models Towards Improved Human Health
Authors
Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, and 8 others
Published
Abstract

We present HealthBench, an open-source benchmark measuring the performance and safety of large language models in healthcare. HealthBench consists of 5,000 multi-turn conversations between a model and an individual user or healthcare professional. Responses are evaluated using conversation-specific rubrics created by 262 physicians. Unlike previous multiple-choice or short-answer benchmarks, HealthBench enables realistic, open-ended evaluation through 48,562 unique rubric criteria spanning several health contexts (e.g., emergencies, transforming clinical data, global health) and behavioral dimensions (e.g., accuracy, instruction following, communication). HealthBench performance over the last two years reflects steady initial progress (compare GPT-3.5 Turbo's 16% to GPT-4o's 32%) and more rapid recent improvements (o3 scores 60%). Smaller models have especially improved: GPT-4.1 nano outperforms GPT-4o and is 25 times cheaper. We additionally release two HealthBench variations: HealthBench Consensus, which includes 34 particularly important dimensions of model behavior validated via physician consensus, and HealthBench Hard, where the current top score is 32%. We hope that HealthBench grounds progress towards model development and applications that benefit human health.

FAQ

Common questions about the HealthBench Consensus benchmark and leaderboard.

What is the HealthBench Consensus benchmark?

HealthBench Consensus is a HealthBench subset focused on questions where physician-created rubric criteria have especially high agreement, measuring healthcare performance and safety on consensus-evaluable conversations.

What is the HealthBench Consensus leaderboard?

The HealthBench Consensus leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, GPT-5.5 Instant by OpenAI leads with a score of 0.947. The average score across all models is 0.947.

What is the highest HealthBench Consensus score?

The highest HealthBench Consensus score is 0.947, achieved by GPT-5.5 Instant from OpenAI.

How many models are evaluated on HealthBench Consensus?

1 models have been evaluated on the HealthBench Consensus benchmark, with 0 verified results and 1 self-reported results.

Where can I find the HealthBench Consensus paper?

The HealthBench Consensus paper is available at https://arxiv.org/abs/2505.08775. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does HealthBench Consensus cover?

HealthBench Consensus is categorized under healthcare. The benchmark evaluates text models.

What's the difference between HealthBench Consensus and HealthBench?

HealthBench Consensus is a variant of HealthBench. See the HealthBench leaderboard for the broader benchmark and per-model comparison.

Which model offers the best value on HealthBench Consensus?

Among models scoring within 10% of the leader, GPT-5.5 Instant from OpenAI is the cheapest, at $5.00 per million input tokens with a score of 0.947.

How recent are the HealthBench Consensus leaderboard results?

The HealthBench Consensus leaderboard was last updated in June 2026 and currently includes 1 evaluated models.