TriviaQA

Paper

Progress Over Time

Interactive timeline showing model performance evolution on TriviaQA

State-of-the-art frontier
Open
Proprietary

TriviaQA Leaderboard

18 models
ContextCostLicense
1
Moonshot AI
Moonshot AI
1.0T
227B
31.0T1.0M$0.43 / $0.87
424B
424B
624B
78B
89B
9
Mistral AI
Mistral AI
675B
914B
1112B
122B
128B
148B
158B
168B
162B
183B
Notice missing or incorrect data?
About this benchmark

What is TriviaQA?

A large-scale reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents (six per question on average) that provide high quality distant supervision for answering the questions. The dataset features relatively complex, compositional questions with considerable syntactic and lexical variability, requiring cross-sentence reasoning to find answers.

TriviaQA is a text benchmark evaluating models on reasoning and general tasks. LLM Stats tracks 18 models on this benchmark, scored on a 0–1 scale. The current average is 0.7, with the leader at 0.9.

Compare leaders on the best AI for reasoning and best AI for general leaderboards.

Current leaders

Kimi K2 Base from Moonshot AI currently leads the TriviaQA leaderboard with a score of 0.851 across 18 evaluated AI models.

1Kimi K2 BaseMoonshot AI85.1%
2Gemma 2 27BGoogle83.7%
3MiMo-V2.5-ProXiaomi81.3%

Source paper

Title
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Authors
Mandar Joshi, Eunsol Choi, Daniel S. Weld, Luke Zettlemoyer
Published
Abstract

We present TriviaQA, a challenging reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents, six per question on average, that provide high quality distant supervision for answering the questions. We show that, in comparison to other recently introduced large-scale datasets, TriviaQA (1) has relatively complex, compositional questions, (2) has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and (3) requires more cross sentence reasoning to find answers. We also present two baseline algorithms: a feature-based classifier and a state-of-the-art neural network, that performs well on SQuAD reading comprehension. Neither approach comes close to human performance (23% and 40% vs. 80%), suggesting that TriviaQA is a challenging testbed that is worth significant future study. Data and code available at -- http://nlp.cs.washington.edu/triviaqa/

FAQ

Common questions about the TriviaQA benchmark and leaderboard.

What is the TriviaQA benchmark?

A large-scale reading comprehension dataset containing over 650K question-answer-evidence triples. TriviaQA includes 95K question-answer pairs authored by trivia enthusiasts and independently gathered evidence documents (six per question on average) that provide high quality distant supervision for answering the questions. The dataset features relatively complex, compositional questions with considerable syntactic and lexical variability, requiring cross-sentence reasoning to find answers.

What is the TriviaQA leaderboard?

The TriviaQA leaderboard ranks 18 AI models based on their performance on this benchmark. Currently, Kimi K2 Base by Moonshot AI leads with a score of 0.851. The average score across all models is 0.736.

What is the highest TriviaQA score?

The highest TriviaQA score is 0.851, achieved by Kimi K2 Base from Moonshot AI.

How many models are evaluated on TriviaQA?

18 models have been evaluated on the TriviaQA benchmark, with 0 verified results and 18 self-reported results.

Where can I find the TriviaQA paper?

The TriviaQA paper is available at https://arxiv.org/abs/1705.03551. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does TriviaQA cover?

TriviaQA is categorized under reasoning and general. The benchmark evaluates text models.

What is the best open-source model on TriviaQA?

Kimi K2 Base by Moonshot AI is the top-ranked open-source model on TriviaQA, with a score of 0.851 (rank #1).

Which model offers the best value on TriviaQA?

Among models scoring within 10% of the leader, MiMo-V2.5-Pro from Xiaomi is the cheapest, at $0.43 per million input tokens with a score of 0.813.

How recent are the TriviaQA leaderboard results?

The TriviaQA leaderboard was last updated in July 2026 and currently includes 18 evaluated models.