YC-Bench
YC-Bench evaluates agents on long-horizon, open-ended business and investment decision-making. The reported metric is the final assets (fund value, in US dollars) accumulated by the agent over the course of the simulation.
MiniMax M3 from MiniMax currently leads the YC-Bench leaderboard with a score of 2100000.000 across 1 evaluated AI models.
What YC-Bench measures
YC-Bench is a text benchmark that evaluates large language models on finance and agents tasks. LLM Stats tracks 1 model on this benchmark, with a maximum possible score of 10000000. Current average across reported models is 2100000.0, with the leader reaching 2100000.0.
Compare leaders on the best AI for finance and best AI for agents leaderboards.
MiniMax M3 leads with 2100000.000.
Progress Over Time
Interactive timeline showing model performance evolution on YC-Bench
YC-Bench Leaderboard
| Context | Cost | License | ||||
|---|---|---|---|---|---|---|
| 1 | MiniMax M3New MiniMax | — | 1.0M | $0.60 / $2.40 |
FAQ
Common questions about YC-Bench.
More evaluations to explore
Related benchmarks in the same category
A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, eliminating trivial questions, and focusing on reasoning-intensive tasks. Features over 12,000 curated questions across 14 domains and causes a 16-33% accuracy drop compared to original MMLU.
Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains
BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.
Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.
SuperGPQA is a comprehensive benchmark that evaluates large language models across 285 graduate-level academic disciplines. The benchmark contains 25,957 questions covering 13 broad disciplinary areas including Engineering, Medicine, Science, and Law, with specialized fields in light industry, agriculture, and service-oriented domains. It employs a Human-LLM collaborative filtering mechanism with over 80 expert annotators to create challenging questions that assess graduate-level knowledge and reasoning capabilities.
Extended version of MMLU-Pro providing additional challenging multiple-choice questions for evaluating language models across diverse academic and professional domains. Built on the foundation of the Massive Multitask Language Understanding benchmark framework.