STEM

Paper

Progress Over Time

Interactive timeline showing model performance evolution on STEM

State-of-the-art frontier
Open
Proprietary

STEM Leaderboard

1 models
ContextCostLicense
1
Alibaba Cloud / Qwen Team
Alibaba Cloud / Qwen Team
7B
Notice missing or incorrect data?
About this benchmark

What is STEM?

A comprehensive multimodal benchmark dataset with 448 skills and 1,073,146 questions spanning all STEM subjects (Science, Technology, Engineering, Mathematics), designed to test neural models' vision-language STEM skills based on K-12 curriculum. Unlike existing datasets that focus on expert-level ability, this dataset includes fundamental skills designed around educational standards.

STEM is a multimodal benchmark evaluating models on math, multimodal, reasoning, and vision tasks. LLM Stats tracks 1 models on this benchmark, scored on a 0–1 scale. The current average is 0.3, with the leader at 0.3.

Compare leaders on the best AI for math, best AI for multimodal, best AI for reasoning and best AI for vision leaderboards.

Current leaders

Qwen2.5-Coder 7B Instruct from Alibaba Cloud / Qwen Team currently leads the STEM leaderboard with a score of 0.340 across 1 evaluated AI models.

1Qwen2.5-Coder 7B InstructAlibaba Cloud / Qwen Team34.0%

Source paper

Title
Measuring Vision-Language STEM Skills of Neural Models
Authors
Jianhao Shen, Ye Yuan, Srbuhi Mirzoyan, Ming Zhang, and 1 others
Published
Abstract

We introduce a new challenge to test the STEM skills of neural models. The problems in the real world often require solutions, combining knowledge from STEM (science, technology, engineering, and math). Unlike existing datasets, our dataset requires the understanding of multimodal vision-language information of STEM. Our dataset features one of the largest and most comprehensive datasets for the challenge. It includes 448 skills and 1,073,146 questions spanning all STEM subjects. Compared to existing datasets that often focus on examining expert-level ability, our dataset includes fundamental skills and questions designed based on the K-12 curriculum. We also add state-of-the-art foundation models such as CLIP and GPT-3.5-Turbo to our benchmark. Results show that the recent model advances only help master a very limited number of lower grade-level skills (2.5% in the third grade) in our dataset. In fact, these models are still well below (averaging 54.7%) the performance of elementary students, not to mention near expert-level performance. To understand and increase the performance on our dataset, we teach the models on a training split of our dataset. Even though we observe improved performance, the model performance remains relatively low compared to average elementary students. To solve STEM problems, we will need novel algorithmic innovations from the community.

FAQ

Common questions about the STEM benchmark and leaderboard.

What is the STEM benchmark?

A comprehensive multimodal benchmark dataset with 448 skills and 1,073,146 questions spanning all STEM subjects (Science, Technology, Engineering, Mathematics), designed to test neural models' vision-language STEM skills based on K-12 curriculum. Unlike existing datasets that focus on expert-level ability, this dataset includes fundamental skills designed around educational standards.

What is the STEM leaderboard?

The STEM leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Qwen2.5-Coder 7B Instruct by Alibaba Cloud / Qwen Team leads with a score of 0.340. The average score across all models is 0.340.

What is the highest STEM score?

The highest STEM score is 0.340, achieved by Qwen2.5-Coder 7B Instruct from Alibaba Cloud / Qwen Team.

How many models are evaluated on STEM?

1 models have been evaluated on the STEM benchmark, with 0 verified results and 1 self-reported results.

Where can I find the STEM paper?

The STEM paper is available at https://arxiv.org/abs/2402.17205. The paper details the methodology, dataset construction, and evaluation criteria.

What categories does STEM cover?

STEM is categorized under math, multimodal, reasoning, and vision. The benchmark evaluates multimodal models.

What is the best open-source model on STEM?

Qwen2.5-Coder 7B Instruct by Alibaba Cloud / Qwen Team is the top-ranked open-source model on STEM, with a score of 0.340 (rank #1).

How recent are the STEM leaderboard results?

The STEM leaderboard was last updated in July 2026 and currently includes 1 evaluated models.