SciCode
Progress Over Time
Interactive timeline showing model performance evolution on SciCode
SciCode Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | ByteDance | — | — | — | ||
| 2 | Google | — | 1.0M | $2.50 / $15.00 | ||
| 3 | ByteDance | — | — | — | ||
| 4 | Alibaba Cloud / Qwen Team | — | 1.0M | $1.25 / $3.75 | ||
| 5 | Moonshot AI | 1.0T | 262K | $0.75 / $3.50 | ||
| 6 | Alibaba Cloud / Qwen Team | — | 1.0M | $0.32 / $1.28 | ||
| 7 | Moonshot AI | 1.0T | — | — | ||
| 8 | Moonshot AI | 1.0T | — | — | ||
| 9 | 550B | — | — | |||
| 10 | 120B | — | — | |||
| 11 | Zhipu AI | 355B | — | — | ||
| 12 | MiniMax | 230B | 1.0M | $0.30 / $1.20 | ||
| 13 | Cohere | 30B | — | — | ||
| 14 | Inception | — | 128K | $0.25 / $0.75 | ||
| 14 | Cohere | 218B | — | — | ||
| 16 | Zhipu AI | 106B | — | — | ||
| 17 | MiniMax | 230B | 1.0M | $0.30 / $1.20 | ||
| 18 | 32B | 262K | $0.06 / $0.24 |
What is SciCode?
SciCode is a research coding benchmark curated by scientists that challenges language models to code solutions for scientific problems. It contains 338 subproblems decomposed from 80 challenging main problems across 16 natural science sub-fields including mathematics, physics, chemistry, biology, and materials science. Problems require knowledge recall, reasoning, and code synthesis skills.
SciCode is a text benchmark evaluating models on math, physics, reasoning, biology, chemistry, and code tasks. LLM Stats tracks 18 models on this benchmark, scored on a 0–1 scale. The current average is 0.5, with the leader at 0.6.
Compare leaders on the best AI for math, best AI for physics, best AI for reasoning, best AI for biology, best AI for chemistry and best AI for code leaderboards.
Current leaders
Seed 2.1 Pro from ByteDance currently leads the SciCode leaderboard with a score of 0.598 across 18 evaluated AI models.
Source paper
- Title
- SciCode: A Research Coding Benchmark Curated by Scientists
- Authors
- Minyang Tian, Luyu Gao, Shizhuo Dylan Zhang, Xinan Chen, and 26 others
- Published
- arXiv
- 2407.13168
Abstract
Since language models (LMs) now outperform average humans on many challenging tasks, it has become increasingly difficult to develop challenging, high-quality, and realistic evaluations. We address this issue by examining LMs' capabilities to generate code for solving real scientific research problems. Incorporating input from scientists and AI researchers in 16 diverse natural science sub-fields, including mathematics, physics, chemistry, biology, and materials science, we created a scientist-curated coding benchmark, SciCode. The problems in SciCode naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains 338 subproblems decomposed from 80 challenging main problems. It offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only 4.6% of the problems in the most realistic setting. We believe that SciCode demonstrates both contemporary LMs' progress towards becoming helpful scientific assistants and sheds light on the development and evaluation of scientific AI in the future.
FAQ
Common questions about the SciCode benchmark and leaderboard.