Coding
Real repository tasks, graded by working engineers.
- Agent traces
- PR review
- Bug repair
- Refactors
Pre-release on the standardized suite. Bespoke benchmarks for coding, long-context reasoning, and finance.
Connect your model. We run the same benchmarks the public leaderboard uses, before launch.
448 expert questions in biology, physics, and chemistry. PhDs reach 65%.
Multi-turn dialogue: instruction retention, inference memory, versioned editing, self-coherence.
400 embodied-reasoning questions: spatial, trajectory, action, state, multi-view.
Constraint and optimization problems with verifiable solutions.
Long-context retrieval with no lexical overlap between query and answer.
503 questions across 8k–2M-word contexts: doc QA, in-context learning, code repos.
Multi-round coreference with 8 needles in a single long conversation.
Multi-discipline multimodal understanding. Stricter than the original MMMU.
1,581 GUI grounding instructions across 23 professional apps and 3 OSes.
1,719 examples up to 32k tokens. Whether long-form answers stay grounded.
5,000 multi-turn healthcare dialogues. Rubrics from 262 physicians.
You bring the application or the data. We build the benchmark.
Real repository tasks, graded by working engineers.
100k+ token contexts with adversarial needles.
Graded by analysts with capital-markets experience.
Send us a model.
We respond within a day.
2,000+ researchers discussing models and benchmarks