OlympiadBench

Name: OlympiadBench Leaderboard — AI Model Scores
Creator: LLM Stats
License: https://llm-stats.com/legal/terms-of-service

Paper

Progress Over Time

Interactive timeline showing model performance evolution on OlympiadBench

State-of-the-art frontier

Open

Proprietary

OlympiadBench Leaderboard

1 models

				Context	Cost	License
1	QvQ-72B-Preview Alibaba Cloud / Qwen Team		73B	—	—

Notice missing or incorrect data?

About this benchmark

What is OlympiadBench?

A challenging benchmark for promoting AGI with Olympiad-level bilingual multimodal scientific problems. Comprises 8,476 math and physics problems from international and Chinese Olympiads and the Chinese college entrance exam, featuring expert-level annotations for step-by-step reasoning. Includes both text-only and multimodal problems in English and Chinese.

OlympiadBench is a multimodal benchmark evaluating models on math, multimodal, physics, reasoning, and vision tasks. LLM Stats tracks 1 models on this benchmark, scored on a 0–1 scale. The current average is 0.2, with the leader at 0.2.

Compare leaders on the best AI for math, best AI for multimodal, best AI for physics, best AI for reasoning and best AI for vision leaderboards.

Current leaders

QvQ-72B-Preview from Alibaba Cloud / Qwen Team currently leads the OlympiadBench leaderboard with a score of 0.204 across 1 evaluated AI models.

QvQ-72B-PreviewAlibaba Cloud / Qwen Team20.4%

Source paper

Title: OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems
Authors: Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, and 10 others
Published: February 21, 2024
arXiv: 2402.14008

Abstract

Recent advancements have seen Large Language Models (LLMs) and Large Multimodal Models (LMMs) surpassing general human capabilities in various tasks, approaching the proficiency level of human experts across multiple domains. With traditional benchmarks becoming less challenging for these models, new rigorous challenges are essential to gauge their advanced abilities. In this work, we present OlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions, including the Chinese college entrance exam. Each problem is detailed with expert-level annotations for step-by-step reasoning. Evaluating top-tier models on OlympiadBench, we implement a comprehensive assessment methodology to accurately evaluate model responses. Notably, the best-performing model, GPT-4V, attains an average score of 17.97% on OlympiadBench, with a mere 10.74% in physics, highlighting the benchmark rigor and the intricacy of physical reasoning. Our analysis orienting GPT-4V points out prevalent issues with hallucinations, knowledge omissions, and logical fallacies. We hope that our challenging benchmark can serve as a valuable resource for helping future AGI research endeavors. The data and evaluation code are available at \url{https://github.com/OpenBMB/OlympiadBench}

FAQ

Common questions about the OlympiadBench benchmark and leaderboard.

What is the OlympiadBench benchmark?

What is the OlympiadBench leaderboard?

The OlympiadBench leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, QvQ-72B-Preview by Alibaba Cloud / Qwen Team leads with a score of 0.204. The average score across all models is 0.204.

What is the highest OlympiadBench score?

The highest OlympiadBench score is 0.204, achieved by QvQ-72B-Preview from Alibaba Cloud / Qwen Team.

How many models are evaluated on OlympiadBench?

1 models have been evaluated on the OlympiadBench benchmark, with 0 verified results and 1 self-reported results.

Where can I find the OlympiadBench paper?

The OlympiadBench paper is available at https://arxiv.org/abs/2402.14008. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does OlympiadBench cover?

OlympiadBench is categorized under math, multimodal, physics, reasoning, and vision. The benchmark evaluates multimodal models with multilingual support.

What is the best open-source model on OlympiadBench?

QvQ-72B-Preview by Alibaba Cloud / Qwen Team is the top-ranked open-source model on OlympiadBench, with a score of 0.204 (rank #1).

How recent are the OlympiadBench leaderboard results?

The OlympiadBench leaderboard was last updated in July 2026 and currently includes 1 evaluated models.