HealthBench
An open-source benchmark for measuring performance and safety of large language models in healthcare, consisting of 5,000 multi-turn conversations evaluated by 262 physicians using 48,562 unique rubric criteria across health contexts and behavioral dimensions
Progress Over Time
Interactive timeline showing model performance evolution on HealthBench
State-of-the-art frontier
Open
Proprietary
HealthBench Leaderboard
4 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Moonshot AI | 1.0T | — | — | ||
| 2 | OpenAI | 117B | 131K | $0.09 / $0.45 | ||
| 3 | OpenAI | — | 128K | $1.75 / $14.00 | ||
| 4 | OpenAI | 21B | 131K | $0.05 / $0.20 |
Notice missing or incorrect data?
FAQ
Common questions about HealthBench
An open-source benchmark for measuring performance and safety of large language models in healthcare, consisting of 5,000 multi-turn conversations evaluated by 262 physicians using 48,562 unique rubric criteria across health contexts and behavioral dimensions
The HealthBench paper is available at https://arxiv.org/abs/2505.08775. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The HealthBench leaderboard ranks 4 AI models based on their performance on this benchmark. Currently, Kimi K2-Thinking-0905 by Moonshot AI leads with a score of 0.580. The average score across all models is 0.530.
The highest HealthBench score is 0.580, achieved by Kimi K2-Thinking-0905 from Moonshot AI.
4 models have been evaluated on the HealthBench benchmark, with 0 verified results and 4 self-reported results.
HealthBench is categorized under healthcare. The benchmark evaluates text models.