Evaluation for frontier models.

Pre-release on the standardized suite. Bespoke benchmarks for coding, long-context reasoning, and finance.

Pre-release evaluations.

Connect your model. We run the same benchmarks the public leaderboard uses, before launch.

Reasoning & knowledge

GPQA
448 expert questions in biology, physics, and chemistry. PhDs reach 65%.
Multi-Challenge
Multi-turn dialogue: instruction retention, inference memory, versioned editing, self-coherence.
ERQA
400 embodied-reasoning questions: spatial, trajectory, action, state, multi-view.
OptimBench
Constraint and optimization problems with verifiable solutions.

Long context

nolima
Long-context retrieval with no lexical overlap between query and answer.
LongBench v2
503 questions across 8k–2M-word contexts: doc QA, in-context learning, code repos.
MRCR v2 (8-needle)
Multi-round coreference with 8 needles in a single long conversation.

Multimodal

MMMU-Pro
Multi-discipline multimodal understanding. Stricter than the original MMMU.
ScreenSpot Pro
1,581 GUI grounding instructions across 23 professional apps and 3 OSes.

Factuality & domain

FACTS Grounding
1,719 examples up to 32k tokens. Whether long-form answers stay grounded.
HealthBench Hard
5,000 multi-turn healthcare dialogues. Rubrics from 262 physicians.

Bespoke benchmarks.

You bring the application or the data. We build the benchmark.

Coding

Real repository tasks, graded by working engineers.

Agent traces
PR review
Bug repair
Refactors

Long-context reasoning

100k+ token contexts with adversarial needles.

Multi-doc QA
Coreference
Conclusion-following

Finance

Graded by analysts with capital-markets experience.

10-Ks
Term sheets
Earnings calls
Tabular reasoning

How we run the annotation pipeline

Send us a model.

We respond within a day.

Email the founders

Best for longer threads or attaching files.

[email protected]

Join the community

2,000+ researchers discussing models and benchmarks