YC-Bench

YC-Bench evaluates agents on long-horizon, open-ended business and investment decision-making. The reported metric is the final assets (fund value, in US dollars) accumulated by the agent over the course of the simulation.

MiniMax M3 from MiniMax currently leads the YC-Bench leaderboard with a score of 2100000.000 across 1 evaluated AI models.

About this benchmark

What YC-Bench measures

YC-Bench is a text benchmark that evaluates large language models on finance and agents tasks. LLM Stats tracks 1 model on this benchmark, with a maximum possible score of 10000000. Current average across reported models is 2100000.0, with the leader reaching 2100000.0.

Compare leaders on the best AI for finance and best AI for agents leaderboards.

MiniMaxMiniMax M3 leads with 2100000.000.

Progress Over Time

Interactive timeline showing model performance evolution on YC-Bench

State-of-the-art frontier
Open
Proprietary

YC-Bench Leaderboard

1 models
ContextCostLicense
1
MiniMax
MiniMax
1.0M$0.60 / $2.40
Notice missing or incorrect data?

FAQ

Common questions about YC-Bench.

What is the YC-Bench benchmark?

YC-Bench evaluates agents on long-horizon, open-ended business and investment decision-making. The reported metric is the final assets (fund value, in US dollars) accumulated by the agent over the course of the simulation.

What is the YC-Bench leaderboard?

The YC-Bench leaderboard ranks 1 AI models based on their performance on this benchmark. Currently, MiniMax M3 by MiniMax leads with a score of 2100000.000. The average score across all models is 2100000.000.

What is the highest YC-Bench score?

The highest YC-Bench score is 2100000.000, achieved by MiniMax M3 from MiniMax.

How many models are evaluated on YC-Bench?

1 models have been evaluated on the YC-Bench benchmark, with 0 verified results and 1 self-reported results.

What categories does YC-Bench cover?

YC-Bench is categorized under finance and agents. The benchmark evaluates text models.

What is the best open-source model on YC-Bench?

MiniMax M3 by MiniMax is the top-ranked open-source model on YC-Bench, with a score of 2100000.000 (rank #1).

Which model offers the best value on YC-Bench?

Among models scoring within 10% of the leader, MiniMax M3 from MiniMax is the cheapest, at $0.60 per million input tokens with a score of 2100000.000.

How recent are the YC-Bench leaderboard results?

The YC-Bench leaderboard was last updated in June 2026 and currently includes 1 evaluated models.

More evaluations to explore

Related benchmarks in the same category

View all finance
MMLU-Pro

A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, eliminating trivial questions, and focusing on reasoning-intensive tasks. Features over 12,000 curated questions across 14 domains and causes a 16-33% accuracy drop compared to original MMLU.

finance
124 models
MMLU

Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains

finance
99 models
BrowseComp

BrowseComp is a benchmark comprising 1,266 questions that challenge AI agents to persistently navigate the internet in search of hard-to-find, entangled information. The benchmark measures agents' ability to exercise persistence in information gathering, demonstrate creativity in web navigation, and find concise, verifiable answers. Despite the difficulty of the questions, BrowseComp is simple and easy-to-use, as predicted answers are short and easily verifiable against reference answers.

agents
48 models
Terminal-Bench 2.0

Terminal-Bench 2.0 is an updated benchmark for testing AI agents' tool use ability to operate a computer via terminal. It evaluates how well models can handle real-world, end-to-end tasks autonomously, including compiling code, training models, setting up servers, system administration, security tasks, data science workflows, and cybersecurity vulnerabilities.

agents
44 models
SuperGPQA

SuperGPQA is a comprehensive benchmark that evaluates large language models across 285 graduate-level academic disciplines. The benchmark contains 25,957 questions covering 13 broad disciplinary areas including Engineering, Medicine, Science, and Law, with specialized fields in light industry, agriculture, and service-oriented domains. It employs a Human-LLM collaborative filtering mechanism with over 80 expert annotators to create challenging questions that assess graduate-level knowledge and reasoning capabilities.

finance
31 models
MMLU-ProX

Extended version of MMLU-Pro providing additional challenging multiple-choice questions for evaluating language models across diverse academic and professional domains. Built on the foundation of the Massive Multitask Language Understanding benchmark framework.

finance
30 models