BixBench

BixBench is a benchmark for real-world bioinformatics and computational biology data analysis. It evaluates AI models on multi-step scientific workflows that require code execution, statistical reasoning, and biological domain knowledge to interpret experimental data.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on BixBench

State-of-the-art frontier
Open
Proprietary

BixBench Leaderboard

1 models
ContextCostLicense
1
OpenAI
OpenAI
1.0M$5.00 / $30.00
Notice missing or incorrect data?

FAQ

Common questions about BixBench

BixBench is a benchmark for real-world bioinformatics and computational biology data analysis. It evaluates AI models on multi-step scientific workflows that require code execution, statistical reasoning, and biological domain knowledge to interpret experimental data.
The BixBench paper is available at https://arxiv.org/abs/2503.00096. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The BixBench leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, GPT-5.5 by OpenAI leads with a score of 0.805. The average score across all models is 0.805.
The highest BixBench score is 0.805, achieved by GPT-5.5 from OpenAI.
1 models have been evaluated on the BixBench benchmark, with 0 verified results and 1 self-reported results.
BixBench is categorized under agents, reasoning, and science. The benchmark evaluates text models.