Rankings draw from 18 benchmarks across three skill axes: retrieval accuracy (does the model find the right sources?), source attribution (can each claim be traced back to a source?), and multi-document synthesis (can the model integrate evidence from multiple sources without contradiction?). Hallucination resistance is weighted heavily — a model that confidently cites fake papers is worse than one that admits it doesn't know.
Tests cover both web-grounded research (where the model searches the live web) and document-grounded research (RAG over a fixed corpus). These are different skills: web research rewards recency awareness and source quality judgment; corpus research rewards passage retrieval and long-context reasoning over the retrieved chunks.
Scores are normalized across benchmarks. Where independent reproductions exist, they get higher weight than self-reported scores from model cards — research benchmarks in particular are easy to game with prompt engineering, so independent runs are more reliable.