Benchmarks/agents/SWE-Bench Multimodal

SWE-Bench Multimodal

SWE-Bench Multimodal extends SWE-Bench to evaluate language models on software engineering tasks that involve visual inputs such as screenshots, UI mockups, and diagrams alongside code understanding.

Progress Over Time

Interactive timeline showing model performance evolution on SWE-Bench Multimodal

State-of-the-art frontier
Open
Proprietary

SWE-Bench Multimodal Leaderboard

1 models
ContextCostLicense
1$25.00 / $125.00
Notice missing or incorrect data?

FAQ

Common questions about SWE-Bench Multimodal

SWE-Bench Multimodal extends SWE-Bench to evaluate language models on software engineering tasks that involve visual inputs such as screenshots, UI mockups, and diagrams alongside code understanding.
The SWE-Bench Multimodal leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Claude Mythos Preview by Anthropic leads with a score of 0.590. The average score across all models is 0.590.
The highest SWE-Bench Multimodal score is 0.590, achieved by Claude Mythos Preview from Anthropic.
1 models have been evaluated on the SWE-Bench Multimodal benchmark, with 0 verified results and 1 self-reported results.
SWE-Bench Multimodal is categorized under agents, code, multimodal, reasoning, and vision. The benchmark evaluates multimodal models.