Better measurement for AI.

AI is the most consequential technology of our lifetime. What gets built with it depends on how well we understand what it can do.

Right now, most benchmarks are chosen by the people selling the models. Results are cherry-picked. Failures are omitted. Decisions worth millions are made on evidence that wouldn't survive peer review.

We think that's a problem worth fixing.

Without open, reproducible evaluation, we can't predict how these systems will reshape work, science, or public trust. The second-order effects stay invisible until it's too late.

We're two people. 200+ benchmarks. 500K+ comparisons. 200K+ users. 140+ countries. Frontier labs and Fortune 500 teams run on our infrastructure.

If this matters to you, we'd like to hear from you.

Co-founders, llm-stats