DeepSearchQA

DeepSearchQA is a benchmark for evaluating deep search and question-answering capabilities, testing models' ability to perform multi-hop reasoning and information retrieval across complex knowledge domains.

Claude Opus 4.6 from Anthropic currently leads the DeepSearchQA leaderboard with a score of 0.913 across 5 evaluated AI models.