SciCode

Paper

Progress Over Time

Interactive timeline showing model performance evolution on SciCode

State-of-the-art frontier
Open
Proprietary

SciCode Leaderboard

18 models
ContextCostLicense
1
ByteDance
ByteDance
21.0M$2.50 / $15.00
3
4
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$1.25 / $3.75
5
Moonshot AI
Moonshot AI
1.0T262K$0.75 / $3.50
6
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
1.0M$0.32 / $1.28
7
Moonshot AI
Moonshot AI
1.0T
81.0T
9550B
10120B
11
Zhipu AI
Zhipu AI
355B
12230B1.0M$0.30 / $1.20
1330B
14
Inception
Inception
128K$0.25 / $0.75
14218B
16
Zhipu AI
Zhipu AI
106B
17
MiniMax
MiniMax
230B1.0M$0.30 / $1.20
1832B262K$0.06 / $0.24
Notice missing or incorrect data?
About this benchmark

What is SciCode?

SciCode is a research coding benchmark curated by scientists that challenges language models to code solutions for scientific problems. It contains 338 subproblems decomposed from 80 challenging main problems across 16 natural science sub-fields including mathematics, physics, chemistry, biology, and materials science. Problems require knowledge recall, reasoning, and code synthesis skills.

SciCode is a text benchmark evaluating models on math, physics, reasoning, biology, chemistry, and code tasks. LLM Stats tracks 18 models on this benchmark, scored on a 0–1 scale. The current average is 0.5, with the leader at 0.6.

Compare leaders on the best AI for math, best AI for physics, best AI for reasoning, best AI for biology, best AI for chemistry and best AI for code leaderboards.

Current leaders

Seed 2.1 Pro from ByteDance currently leads the SciCode leaderboard with a score of 0.598 across 18 evaluated AI models.

1Seed 2.1 ProByteDance59.8%
2Gemini 3.1 ProGoogle59.0%
3Seed 2.1 TurboByteDance57.8%
OSSKimi K2.6#5 open-weight52.2%

Source paper

Title
SciCode: A Research Coding Benchmark Curated by Scientists
Authors
Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, and 26 others
Published
Abstract

Since language models (LMs) now outperform average humans on many challenging tasks, it has become increasingly difficult to develop challenging, high-quality, and realistic evaluations. We address this issue by examining LMs' capabilities to generate code for solving real scientific research problems. Incorporating input from scientists and AI researchers in 16 diverse natural science sub-fields, including mathematics, physics, chemistry, biology, and materials science, we created a scientist-curated coding benchmark, SciCode. The problems in SciCode naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems. It offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only 4.6% of the problems in the most realistic setting. We believe that SciCode demonstrates both contemporary LMs' progress towards becoming helpful scientific assistants and sheds light on the development and evaluation of scientific AI in the future.

FAQ

Common questions about the SciCode benchmark and leaderboard.

What is the SciCode benchmark?

SciCode is a research coding benchmark curated by scientists that challenges language models to code solutions for scientific problems. It contains 338 subproblems decomposed from 80 challenging main problems across 16 natural science sub-fields including mathematics, physics, chemistry, biology, and materials science. Problems require knowledge recall, reasoning, and code synthesis skills.

What is the SciCode leaderboard?

The SciCode leaderboard ranks 18 AI models based on their performance on this benchmark. Currently, Seed 2.1 Pro by ByteDance leads with a score of 0.598. The average score across all models is 0.453.

What is the highest SciCode score?

The highest SciCode score is 0.598, achieved by Seed 2.1 Pro from ByteDance.

How many models are evaluated on SciCode?

18 models have been evaluated on the SciCode benchmark, with 0 verified results and 18 self-reported results.

Where can I find the SciCode paper?

The SciCode paper is available at https://arxiv.org/abs/2407.13168. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does SciCode cover?

SciCode is categorized under math, physics, reasoning, biology, chemistry, and code. The benchmark evaluates text models.

What is the best open-source model on SciCode?

Kimi K2.6 by Moonshot AI is the top-ranked open-source model on SciCode, with a score of 0.522 (rank #5).

Which model offers the best value on SciCode?

Among models scoring within 10% of the leader, Gemini 3.1 Pro from Google is the cheapest, at $2.50 per million input tokens with a score of 0.590.

How recent are the SciCode leaderboard results?

The SciCode leaderboard was last updated in July 2026 and currently includes 18 evaluated models.