CharadesSTA
Charades-STA is a benchmark dataset for temporal activity localization via language queries, extending the Charades dataset with sentence temporal annotations. It contains 12,408 training and 3,720 testing segment-sentence pairs from videos with natural language descriptions and precise temporal boundaries for localizing activities based on language queries.
Qwen3 VL 235B A22B Instruct from Alibaba Cloud / Qwen Team currently leads the CharadesSTA leaderboard with a score of 0.648 across 12 evaluated AI models.
Qwen3 VL 235B A22B Instruct leads with 64.8%, followed by
Qwen3 VL 235B A22B Thinking at 63.5% and
Qwen3 VL 30B A3B Instruct at 63.5%.
Progress Over Time
Interactive timeline showing model performance evolution on CharadesSTA
CharadesSTA Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.30 / $1.49 | ||
| 2 | Alibaba Cloud / Qwen Team | 236B | 262K | $0.45 / $3.49 | ||
| 2 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 / $0.70 | ||
| 4 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 5 | Alibaba Cloud / Qwen Team | 31B | 262K | $0.20 / $1.00 | ||
| 6 | Alibaba Cloud / Qwen Team | 33B | — | — | ||
| 7 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.18 / $2.09 | ||
| 8 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $1.00 | ||
| 9 | Alibaba Cloud / Qwen Team | 9B | 262K | $0.08 / $0.50 | ||
| 10 | Alibaba Cloud / Qwen Team | 4B | 262K | $0.10 / $0.60 | ||
| 11 | Alibaba Cloud / Qwen Team | 34B | — | — | ||
| 12 | Alibaba Cloud / Qwen Team | 8B | — | — |
FAQ
Common questions about CharadesSTA.
More evaluations to explore
Related benchmarks in the same category
A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, eliminating trivial questions, and focusing on reasoning-intensive tasks. Features over 12,000 curated questions across 14 domains and causes a 16-33% accuracy drop compared to original MMLU.
Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains
Humanity's Last Exam (HLE) is a multi-modal academic benchmark with 2,500 questions across mathematics, humanities, and natural sciences, designed to test LLM capabilities at the frontier of human knowledge with unambiguous, verifiable solutions
MMMU (Massive Multi-discipline Multimodal Understanding) is a benchmark designed to evaluate multimodal models on college-level subject knowledge and deliberate reasoning. Contains 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering across 30 subjects and 183 subfields.
A more robust multi-discipline multimodal understanding benchmark that enhances MMMU through a three-step process: filtering text-only answerable questions, augmenting candidate options, and introducing vision-only input settings. Achieves significantly lower model performance (16.8-26.9%) compared to original MMMU, providing more rigorous evaluation that closely mimics real-world scenarios.
An improved version of the MMLU benchmark featuring manually re-annotated questions to identify and correct errors in the original dataset. Provides more reliable evaluation metrics for language models by addressing dataset quality issues found in the original MMLU.