STEM

A comprehensive multimodal benchmark dataset with 448 skills and 1,073,146 questions spanning all STEM subjects (Science, Technology, Engineering, Mathematics), designed to test neural models' vision-language STEM skills based on K-12 curriculum. Unlike existing datasets that focus on expert-level ability, this dataset includes fundamental skills designed around educational standards.

Paper

Progress Over Time

Interactive timeline showing model performance evolution on STEM

State-of-the-art frontier
Open
Proprietary

STEM Leaderboard

1 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
Notice missing or incorrect data?

FAQ

Common questions about STEM

A comprehensive multimodal benchmark dataset with 448 skills and 1,073,146 questions spanning all STEM subjects (Science, Technology, Engineering, Mathematics), designed to test neural models' vision-language STEM skills based on K-12 curriculum. Unlike existing datasets that focus on expert-level ability, this dataset includes fundamental skills designed around educational standards.
The STEM paper is available at https://arxiv.org/abs/2402.17205. This paper provides detailed information about the benchmark methodology, dataset creation, and evaluation criteria.
The STEM leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Qwen2.5-Coder 7B Instruct by Alibaba Cloud / Qwen Team leads with a score of 0.340. The average score across all models is 0.340.
The highest STEM score is 0.340, achieved by Qwen2.5-Coder 7B Instruct from Alibaba Cloud / Qwen Team.
1 models have been evaluated on the STEM benchmark, with 0 verified results and 1 self-reported results.
STEM is categorized under math, multimodal, and reasoning. The benchmark evaluates multimodal models.