STEM
A comprehensive multimodal benchmark dataset with 448 skills and 1,073,146 questions spanning all STEM subjects (Science, Technology, Engineering, Mathematics), designed to test neural models' vision-language STEM skills based on K-12 curriculum. Unlike existing datasets that focus on expert-level ability, this dataset includes fundamental skills designed around educational standards.
Progress Over Time
Interactive timeline showing model performance evolution on STEM
State-of-the-art frontier
Open
Proprietary
STEM Leaderboard
1 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Alibaba Cloud / Qwen Team | 7B | — | — |
Notice missing or incorrect data?
FAQ
Common questions about STEM
A comprehensive multimodal benchmark dataset with 448 skills and 1,073,146 questions spanning all STEM subjects (Science, Technology, Engineering, Mathematics), designed to test neural models' vision-language STEM skills based on K-12 curriculum. Unlike existing datasets that focus on expert-level ability, this dataset includes fundamental skills designed around educational standards.
The STEM paper is available at https://arxiv.org/abs/2402.17205. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The STEM leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Qwen2.5-Coder 7B Instruct by Alibaba Cloud / Qwen Team leads with a score of 0.340. The average score across all models is 0.340.
The highest STEM score is 0.340, achieved by Qwen2.5-Coder 7B Instruct from Alibaba Cloud / Qwen Team.
1 models have been evaluated on the STEM benchmark, with 0 verified results and 1 self-reported results.
STEM is categorized under math, multimodal, and reasoning. The benchmark evaluates multimodal models.