InfiniteBench/En.QA

InfiniteBench English Question Answering variant - first LLM benchmark featuring average data length surpassing 100K tokens for evaluating long-context capabilities with 12 tasks spanning diverse domains

Llama 3.2 3B Instruct from Meta currently leads the InfiniteBench/En.QA leaderboard with a score of 0.198 across 1 evaluated AI models.

Paper

MetaLlama 3.2 3B Instruct leads with 19.8%.

Progress Over Time

Interactive timeline showing model performance evolution on InfiniteBench/En.QA

State-of-the-art frontier
Open
Proprietary

InfiniteBench/En.QA Leaderboard

1 models
ContextCostLicense
13B128K$0.01 / $0.02
Notice missing or incorrect data?

FAQ

Common questions about InfiniteBench/En.QA.

What is the InfiniteBench/En.QA benchmark?

InfiniteBench English Question Answering variant - first LLM benchmark featuring average data length surpassing 100K tokens for evaluating long-context capabilities with 12 tasks spanning diverse domains

What is the InfiniteBench/En.QA leaderboard?

The InfiniteBench/En.QA leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Llama 3.2 3B Instruct by Meta leads with a score of 0.198. The average score across all models is 0.198.

What is the highest InfiniteBench/En.QA score?

The highest InfiniteBench/En.QA score is 0.198, achieved by Llama 3.2 3B Instruct from Meta.

How many models are evaluated on InfiniteBench/En.QA?

1 models have been evaluated on the InfiniteBench/En.QA benchmark, with 0 verified results and 1 self-reported results.

Where can I find the InfiniteBench/En.QA paper?

The InfiniteBench/En.QA paper is available at https://arxiv.org/abs/2402.13718. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does InfiniteBench/En.QA cover?

InfiniteBench/En.QA is categorized under long context. The benchmark evaluates text models.

More evaluations to explore

Related benchmarks in the same category

View all long context
nolima
long context
44 models
LVBench

LVBench is an extreme long video understanding benchmark designed to evaluate multimodal models on videos up to two hours in duration. It contains 6 major categories and 21 subcategories, with videos averaging five times longer than existing datasets. The benchmark addresses applications requiring comprehension of extremely long videos.

long contextmultimodal
20 models
LongBench v2

LongBench v2 is a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. It consists of 503 challenging multiple-choice questions with contexts ranging from 8k to 2M words across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding.

long context
14 models
AA-LCR

Agent Arena Long Context Reasoning benchmark

long context
13 models
EgoSchema

A diagnostic benchmark for very long-form video language understanding consisting of over 5000 human curated multiple choice questions based on 3-minute video clips from Ego4D, covering a broad range of natural human activities and behaviors

long contextvideo
9 models
MLVU

A comprehensive benchmark for multi-task long video understanding that evaluates multimodal large language models on videos ranging from 3 minutes to 2 hours across 9 distinct tasks including reasoning, captioning, recognition, and summarization.

long contextmultimodal
9 models