Benchmarks/healthcare/HealthBench

HealthBench

An open-source benchmark for measuring performance and safety of large language models in healthcare, consisting of 5,000 multi-turn conversations evaluated by 262 physicians using 48,562 unique rubric criteria across health contexts and behavioral dimensions

Paper Dataset Code

Progress Over Time

Interactive timeline showing model performance evolution on HealthBench

State-of-the-art frontier

Open

Proprietary

HealthBench Leaderboard

4 models

			Context	Cost
1	Kimi K2-Thinking-0905 Moonshot AI	1.0T	—	—
2	GPT OSS 120B OpenAI	117B	131K	$0.09 / $0.45
3	GPT-5.3 Chat OpenAI	—	128K	$1.75 / $14.00
4	GPT OSS 20B OpenAI	21B	131K	$0.05 / $0.20

Notice missing or incorrect data?

FAQ

Common questions about HealthBench

The HealthBench paper is available at https://arxiv.org/abs/2505.08775. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.

The HealthBench leaderboard ranks 4 AI models based on their performance on this benchmark. Currently, Kimi K2-Thinking-0905 by Moonshot AI leads with a score of 0.580. The average score across all models is 0.530.

The highest HealthBench score is 0.580, achieved by Kimi K2-Thinking-0905 from Moonshot AI.

4 models have been evaluated on the HealthBench benchmark, with 0 verified results and 4 self-reported results.

HealthBench is categorized under healthcare. The benchmark evaluates text models.

HealthBench

Progress Over Time

HealthBench Leaderboard

FAQ

What is the HealthBench benchmark?

Where can I find the HealthBench paper?

What is the HealthBench leaderboard?

What is the highest HealthBench score?

How many models are evaluated on HealthBench?

What categories does HealthBench cover?