Evaluation for frontier models.

Pre-release on the standardized suite. Bespoke benchmarks for coding, long-context reasoning, and finance.

CodeMathVisionTextAudioReasoning82.2

Pre-release evaluations.

Connect your model. We run the same benchmarks the public leaderboard uses, before launch.

Reasoning & knowledge

  • GPQA

    448 expert questions in biology, physics, and chemistry. PhDs reach 65%.

  • Multi-Challenge

    Multi-turn dialogue: instruction retention, inference memory, versioned editing, self-coherence.

  • ERQA

    400 embodied-reasoning questions: spatial, trajectory, action, state, multi-view.

  • OptimBench

    Constraint and optimization problems with verifiable solutions.

Long context

  • nolima

    Long-context retrieval with no lexical overlap between query and answer.

  • LongBench v2

    503 questions across 8k–2M-word contexts: doc QA, in-context learning, code repos.

  • MRCR v2 (8-needle)

    Multi-round coreference with 8 needles in a single long conversation.

Multimodal

  • MMMU-Pro

    Multi-discipline multimodal understanding. Stricter than the original MMMU.

  • ScreenSpot Pro

    1,581 GUI grounding instructions across 23 professional apps and 3 OSes.

Factuality & domain

  • FACTS Grounding

    1,719 examples up to 32k tokens. Whether long-form answers stay grounded.

  • HealthBench Hard

    5,000 multi-turn healthcare dialogues. Rubrics from 262 physicians.

Bespoke benchmarks.

You bring the application or the data. We build the benchmark.

Coding

Real repository tasks, graded by working engineers.

  • Agent traces
  • PR review
  • Bug repair
  • Refactors

Long-context reasoning

100k+ token contexts with adversarial needles.

  • Multi-doc QA
  • Coreference
  • Conclusion-following

Finance

Graded by analysts with capital-markets experience.

  • 10-Ks
  • Term sheets
  • Earnings calls
  • Tabular reasoning
How we run the annotation pipeline
Y Combinator
Hugging Face
Google
Harvard Medical

Send us a model.

We respond within a day.

Email the founders

Best for longer threads or attaching files.

[email protected]

Join the community

2,000+ researchers discussing models and benchmarks