Benchmarks/agents/GeneBench

GeneBench

GeneBench is an evaluation focused on multi-stage scientific data analysis in genetics and quantitative biology. Tasks require reasoning about ambiguous or noisy data with minimal supervisory guidance, addressing realistic obstacles such as hidden confounders or QC failures, and correctly implementing and interpreting modern statistical methods.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on GeneBench

State-of-the-art frontier
Open
Proprietary

GeneBench Leaderboard

2 models
ContextCostLicense
1
OpenAI
OpenAI
1.0M$30.00 / $180.00
2
OpenAI
OpenAI
1.0M$5.00 / $30.00
Notice missing or incorrect data?

FAQ

Common questions about GeneBench

GeneBench is an evaluation focused on multi-stage scientific data analysis in genetics and quantitative biology. Tasks require reasoning about ambiguous or noisy data with minimal supervisory guidance, addressing realistic obstacles such as hidden confounders or QC failures, and correctly implementing and interpreting modern statistical methods.
The GeneBench paper is available at https://cdn.openai.com/pdf/6dc7175d-d9e7-4b8d-96b8-48fe5798cd5b/oai_genebench_benchmark.pdf. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The GeneBench leaderboard ranks 2 AI models based on their performance on this benchmark. Currently, GPT-5.5 Pro by OpenAI leads with a score of 0.332. The average score across all models is 0.291.
The highest GeneBench score is 0.332, achieved by GPT-5.5 Pro from OpenAI.
2 models have been evaluated on the GeneBench benchmark, with 0 verified results and 2 self-reported results.
GeneBench is categorized under agents, reasoning, and science. The benchmark evaluates text models.