SWE-Bench Multimodal
SWE-Bench Multimodal extends SWE-Bench to evaluate language models on software engineering tasks that involve visual inputs such as screenshots, UI mockups, and diagrams alongside code understanding.
Progress Over Time
Interactive timeline showing model performance evolution on SWE-Bench Multimodal
State-of-the-art frontier
Open
Proprietary
SWE-Bench Multimodal Leaderboard
1 models
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Anthropic | — | — | $25.00 / $125.00 |
Notice missing or incorrect data?
FAQ
Common questions about SWE-Bench Multimodal
SWE-Bench Multimodal extends SWE-Bench to evaluate language models on software engineering tasks that involve visual inputs such as screenshots, UI mockups, and diagrams alongside code understanding.
The SWE-Bench Multimodal leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, Claude Mythos Preview by Anthropic leads with a score of 0.590. The average score across all models is 0.590.
The highest SWE-Bench Multimodal score is 0.590, achieved by Claude Mythos Preview from Anthropic.
1 models have been evaluated on the SWE-Bench Multimodal benchmark, with 0 verified results and 1 self-reported results.
SWE-Bench Multimodal is categorized under agents, code, multimodal, reasoning, and vision. The benchmark evaluates multimodal models.