CheXpert CXR
CheXpert is a large dataset of 224,316 chest radiographs from 65,240 patients for automated chest X-ray interpretation. The dataset includes uncertainty labels for 14 medical observations extracted from radiology reports. It serves as a benchmark for developing and evaluating automated chest radiograph interpretation models.
MedGemma 4B IT from Google currently leads the CheXpert CXR leaderboard with a score of 0.481 across 1 evaluated AI models.
MedGemma 4B IT leads with 48.1%.
Progress Over Time
Interactive timeline showing model performance evolution on CheXpert CXR
CheXpert CXR Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Google | 4B | — | — |
FAQ
Common questions about CheXpert CXR.
More evaluations to explore
Related benchmarks in the same category
A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, eliminating trivial questions, and focusing on reasoning-intensive tasks. Features over 12,000 curated questions across 14 domains and causes a 16-33% accuracy drop compared to original MMLU.
Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains
Humanity's Last Exam (HLE) is a multi-modal academic benchmark with 2,500 questions across mathematics, humanities, and natural sciences, designed to test LLM capabilities at the frontier of human knowledge with unambiguous, verifiable solutions
MMMU (Massive Multi-discipline Multimodal Understanding) is a benchmark designed to evaluate multimodal models on college-level subject knowledge and deliberate reasoning. Contains 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering across 30 subjects and 183 subfields.
A more robust multi-discipline multimodal understanding benchmark that enhances MMMU through a three-step process: filtering text-only answerable questions, augmenting candidate options, and introducing vision-only input settings. Achieves significantly lower model performance (16.8-26.9%) compared to original MMMU, providing more rigorous evaluation that closely mimics real-world scenarios.
MathVista evaluates mathematical reasoning of foundation models in visual contexts. It consists of 6,141 examples derived from 28 existing multimodal datasets and 3 newly created datasets (IQTest, FunctionQA, and PaperQA), combining challenges from diverse mathematical and visual tasks to assess models' ability to understand complex figures and perform rigorous reasoning.