Loading benchmark data...

AI Trends

AI statistics and LLM growth trends visualized. Track AI model performance, pricing evolution, and the race between nations and organizations in generative AI.

Act 1

The Landscape

What's happening in AI right now? A snapshot of the global race for artificial intelligence, from key metrics to the animated competition between nations.

Key Metrics

A snapshot of where AI stands today across performance, cost, and capability.

The AI Arms Race

Watch the global competition unfold as countries race to release AI models. Hit play to see cumulative releases animate through time.

When competition intensifies, innovation accelerates. This pattern repeats across every technological frontier.

Geographic clustering of research creates feedback loops: talent attracts capital, capital funds infrastructure, infrastructure enables breakthroughs. The distribution of model releases maps directly onto these concentrations of capability.

Act 2

The Players

Who's competing? Who's winning? Zooming from nations to laboratories to individual models, and the philosophical divide between open and closed.

Days at the Top

How long each model held the #1 GPQA spot before being dethroned. Watch the bars get shorter as the race accelerates.

Dominance duration shrinks as competition intensifies. The faster things move, the harder any lead is to maintain.

This compression of tenure at the top follows a power law: early advantages compound, but so do the efforts to overcome them. When multiple well-funded teams pursue the same objective, breakthroughs arrive from unexpected directions.

Open vs Closed Releases

What proportion of new models are open source vs proprietary? Track the philosophical divide quarter by quarter.

Open systems eventually catch closed ones. The lag shrinks as knowledge diffuses and techniques become reproducible.

Proprietary advantages erode when the underlying science is understood. Once a capability is proven possible, independent teams can reverse-engineer the approach. The question shifts from 'if' to 'how soon.'

Act 3

The Technology

How is AI actually improving? Understanding capabilities, engineering constraints, and the tradeoffs that define the frontier.

The Multimodal Shift

What percentage of new models are multimodal vs text-only? Vision is now table stakes.

Capabilities stack. Once a modality is solved, it becomes a baseline expectation rather than a differentiator.

Each new input type—vision, audio, video—follows the same adoption curve: initially rare, then common, then required. Systems that can reason across modalities unlock compound capabilities unavailable to single-mode approaches.

Lab Progress

Which labs improved the most? GPQA score change over the last 12 months, ranked by total gain.

Pareto Frontiers

The tradeoffs between performance and cost, speed, and model size across different benchmarks.

Act 4

The Economics

What does intelligence cost? The practical implications of AI pricing, and why the economics matter more than ever.

Cost deflation in compute-intensive industries follows exponential curves, not linear ones.

When infrastructure scales and competition increases, prices don't just fall—they collapse. Each order-of-magnitude reduction unlocks new use cases that were previously economically impossible, expanding the total addressable market.

Capability vs Price

SWE-Bench SOTA capability (line) vs average model price (bars) over time. Same price, more capability.

Act 5

The Measurement Problem

Are we even measuring the right things? A meta-analysis of benchmarks, human preferences, and the gap between what we test and what we value.

Benchmark Saturation

Tracking which benchmarks are approaching saturation vs still challenging. Red zone indicates benchmarks that may no longer differentiate model capabilities.

Every measurement instrument has a ceiling. When systems exceed it, we learn about the test, not the system.

Benchmark saturation is a signal of progress, but also a limitation. A test that everyone passes reveals nothing about relative capability. The search for harder evaluations is itself a form of scientific inquiry.

The Benchmark Genome

Correlation matrix revealing which benchmarks measure similar capabilities.

Do Humans Agree With Benchmarks?

Comparing human arena ratings with benchmark scores. R² reveals how well benchmarks predict what users actually prefer.

Objective performance and subjective preference measure different things. Both matter.

Test scores capture what a system can do in controlled conditions. Human preference captures how it feels to use. The gap between them reveals the difference between capability and experience—a distinction that matters for real-world deployment.

The State of AI in

The pace of change in AI statistics is hard to overstate. Models that topped benchmarks six months ago are now middle of the pack. New AI growth trends are showing up in reasoning depth, multimodal understanding, and raw efficiency.

Much of this AI industry growth comes from labs competing on every front. OpenAI, Anthropic, Google, and Meta keep raising the bar, while Mistral, DeepSeek, and Alibaba release open-weight models that perform surprisingly well. We track these shifts across 500+ models and 50+ benchmarks in our LLM statistics.

Key AI Statistics

The forces shaping how AI models improve

US vs China AI Race

US labs like OpenAI, Anthropic, and Google still lead most benchmarks. But Chinese labs (DeepSeek, Alibaba, ByteDance) are closing in fast, especially on reasoning and coding tasks.

Open vs Closed Source

The gap is shrinking. Llama, Mistral, and Qwen now match or beat GPT-4 on several benchmarks. You can run capable models locally that would have required API access a year ago.

Falling Inference Costs

Prices keep dropping. GPT-4-level performance cost $30/M tokens in 2023. Today you can get it for under $1/M. Competition and better infrastructure are driving 10-100x reductions each year.

Parameter Efficiency

Smaller models are catching up. A 7B model today can hit scores that took 70B+ parameters last year. This means you can run strong models on a laptop or deploy them affordably.

Understanding AI Benchmark Statistics

AI benchmark statistics give you a way to compare models on specific tasks. GPQA tests graduate-level science reasoning. HumanEval measures code generation. MMLU covers broad knowledge. Each benchmark tells you something different about AI performance data.

When you look at LLM growth rate across these benchmarks, the improvement is clear. GPQA scores went from around 50% to 75%+ in just 18 months. That kind of language model growth will likely continue, though some benchmarks are starting to saturate.

500+ models tracked·50+ benchmarks·Updated daily

Frequently Asked Questions

Common questions about AI statistics, growth trends, and industry data

What are the current AI growth trends?

A few things stand out right now. Reasoning models like OpenAI o1 and DeepSeek-R1 are trading speed for accuracy, and it works. Multimodal is becoming table stakes for frontier models. Costs have dropped so much that GPT-4-level performance now runs at 1/100th of what it cost two years ago. And open-source is catching up faster than most expected.

How do US and China compare in AI development?

US labs (OpenAI, Anthropic, Google, Meta) still lead on most benchmarks, but the gap is closing. Chinese labs like DeepSeek, Alibaba, and ByteDance have shipped models that compete on coding and reasoning. Our US vs China chart tracks how each country's best models perform over time.

Are open-source AI models catching up to proprietary ones?

Yes. Llama 3, Mistral, Qwen, and DeepSeek now match GPT-4 and Claude on many benchmarks. Open-weight releases typically lag proprietary models by 6-18 months, but that window keeps shrinking. Check our Open vs Closed chart to see the progression.

How fast are AI inference costs decreasing?

Fast. Roughly 10x per year for the same level of performance. GPT-4-level capabilities cost around $30 per million tokens in early 2023. Now you can get that for under $1. Our Price Trends chart shows how competition and infrastructure improvements keep pushing costs down.

What AI statistics does LLM Stats track?

We track benchmark scores across 50+ evals (GPQA, HumanEval, MMLU, and more), pricing from 20+ API providers, throughput and latency data, plus model specs like parameter counts and context windows. All of this covers 500+ models, updated daily.

Explore More

Dive deeper into AI statistics, benchmarks, and comparisons