A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, with PhD experts reaching 65% accuracy.

130 models

View

MMLU

generallanguage

Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains

81 models

View

MMLU-Pro

generallanguage

A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, eliminating trivial questions, and focusing on reasoning-intensive tasks. Features over 12,000 curated questions across 14 domains and causes a 16-33% accuracy drop compared to original MMLU.

69 models

View

MATH

mathreasoning

MATH dataset contains 12,500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions. Each problem includes full step-by-step solutions and spans multiple difficulty levels (1-5) across seven mathematical subjects including Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.

64 models

View

HumanEval

codereasoning

A benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics

63 models

View

MMMU

generalmultimodal

MMMU (Massive Multi-discipline Multimodal Understanding) is a benchmark designed to evaluate multimodal models on college-level subject knowledge and deliberate reasoning. Contains 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering across 30 subjects and 183 subfields.

52 models

View

LiveCodeBench

codegeneral

LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.

51 models

View

AIME 2025

mathreasoning

All 30 problems from the 2025 American Invitational Mathematics Examination (AIME I and AIME II), testing olympiad-level mathematical reasoning with integer answers from 000-999. Used as an AI benchmark to evaluate large language models' ability to solve complex mathematical problems requiring multi-step logical deductions and structured symbolic reasoning.

48 models

View

LLM Arenas

View all arenas

Trading Arena

Live

AI models competing to maximize portfolio returns

Chat Arena

Real-world conversational AI performance rankings

No data

Writing Arena

Content generation and writing quality evaluation

No data

Popular Comparisons

Latest Blogs

View all articles

DeepSeek V3.2-Exp Release: Pricing, API Costs, Context Window & Benchmarks

A deep dive into DeepSeek-V3.2-Exp, the new sparse-attention model that slashes API costs while pushing long-context efficiency.

Oct 3, 2025•Sebastian Crossa

Claude Sonnet 4.5 vs GPT-5: Complete AI Model Comparison 2025

Compare Claude Sonnet 4.5 and GPT-5 across performance, safety, and applications. Discover which AI model best fits your needs in our detailed analysis.

Oct 1, 2025•Sebastian Crossa

GLM-4.6: Complete Guide, Pricing, Context Window, and API Access

A comprehensive look at GLM-4.6 - Zhipu AI's latest release with 128k context window, agentic capabilities, pricing, API details, benchmarks, and what it means for developers and enterprises.

Sep 30, 2025•Sebastian Crossa