GPQA Chemistry

Name: GPQA Chemistry Leaderboard — AI Model Scores
Creator: LLM Stats
License: https://llm-stats.com/legal/terms-of-service

Paper

Progress Over Time

Interactive timeline showing model performance evolution on GPQA Chemistry

State-of-the-art frontier

Open

Proprietary

GPQA Chemistry Leaderboard

1 models

				Context	Cost	License
1	o1 OpenAI		—	—	—

Notice missing or incorrect data?

About this benchmark

What is GPQA Chemistry?

Chemistry subset of GPQA, containing challenging multiple-choice questions written by domain experts in chemistry. These Google-proof questions require graduate-level knowledge and reasoning.

GPQA Chemistry is a text benchmark evaluating models on reasoning and chemistry tasks. LLM Stats tracks 1 models on this benchmark, scored on a 0–1 scale. The current average is 0.6, with the leader at 0.6.

Compare leaders on the best AI for reasoning and best AI for chemistry leaderboards.

Current leaders

o1 from OpenAI currently leads the GPQA Chemistry leaderboard with a score of 0.647 across 1 evaluated AI models.

o1OpenAI64.7%

Source paper

Title: GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Authors: David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, and 4 others
Published: November 20, 2023
arXiv: 2311.12022

Abstract

We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are "Google-proof"). The questions are also difficult for state-of-the-art AI systems, with our strongest GPT-4 based baseline achieving 39% accuracy. If we are to use future AI systems to help us answer very hard questions, for example, when developing new scientific knowledge, we need to develop scalable oversight methods that enable humans to supervise their outputs, which may be difficult even if the supervisors are themselves skilled and knowledgeable. The difficulty of GPQA both for skilled non-experts and frontier AI systems should enable realistic scalable oversight experiments, which we hope can help devise ways for human experts to reliably get truthful information from AI systems that surpass human capabilities.

FAQ

Common questions about the GPQA Chemistry benchmark and leaderboard.

What is the GPQA Chemistry benchmark?

Chemistry subset of GPQA, containing challenging multiple-choice questions written by domain experts in chemistry. These Google-proof questions require graduate-level knowledge and reasoning.

What is the GPQA Chemistry leaderboard?

The GPQA Chemistry leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, o1 by OpenAI leads with a score of 0.647. The average score across all models is 0.647.

What is the highest GPQA Chemistry score?

The highest GPQA Chemistry score is 0.647, achieved by o1 from OpenAI.

How many models are evaluated on GPQA Chemistry?

1 models have been evaluated on the GPQA Chemistry benchmark, with 0 verified results and 1 self-reported results.

Where can I find the GPQA Chemistry paper?

The GPQA Chemistry paper is available at https://arxiv.org/abs/2311.12022. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does GPQA Chemistry cover?

GPQA Chemistry is categorized under reasoning and chemistry. The benchmark evaluates text models.

How recent are the GPQA Chemistry leaderboard results?

The GPQA Chemistry leaderboard was last updated in June 2026 and currently includes 1 evaluated models.