HealthBench

An open-source benchmark for measuring performance and safety of large language models in healthcare, consisting of 5,000 multi-turn conversations evaluated by 262 physicians using 48,562 unique rubric criteria across health contexts and behavioral dimensions

Progress Over Time

Interactive timeline showing model performance evolution on HealthBench

State-of-the-art frontier
Open
Proprietary

HealthBench Leaderboard

4 models
ContextCostLicense
11.0T
2117B131K$0.09 / $0.45
3128K$1.75 / $14.00
421B131K$0.05 / $0.20
Notice missing or incorrect data?

FAQ

Common questions about HealthBench

An open-source benchmark for measuring performance and safety of large language models in healthcare, consisting of 5,000 multi-turn conversations evaluated by 262 physicians using 48,562 unique rubric criteria across health contexts and behavioral dimensions
The HealthBench paper is available at https://arxiv.org/abs/2505.08775. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The HealthBench leaderboard ranks 4 AI models based on their performance on this benchmark. Currently, Kimi K2-Thinking-0905 by Moonshot AI leads with a score of 0.580. The average score across all models is 0.530.
The highest HealthBench score is 0.580, achieved by Kimi K2-Thinking-0905 from Moonshot AI.
4 models have been evaluated on the HealthBench benchmark, with 0 verified results and 4 self-reported results.
HealthBench is categorized under healthcare. The benchmark evaluates text models.