Loading benchmark data...
AI Trends
AI statistics and LLM growth trends visualized. Track AI model performance, pricing evolution, and the race between nations and organizations in generative AI.
Act 1
The Landscape
What's happening in AI right now? A snapshot of the global race for artificial intelligence, from key metrics to the animated competition between nations.
Key Metrics
A snapshot of where AI stands today across performance, cost, and capability.
The AI Arms Race
Watch the global competition unfold as countries race to release AI models. Hit play to see cumulative releases animate through time.
When competition intensifies, innovation accelerates. This pattern repeats across every technological frontier.
Geographic clustering of research creates feedback loops: talent attracts capital, capital funds infrastructure, infrastructure enables breakthroughs. The distribution of model releases maps directly onto these concentrations of capability.
Act 2
The Players
Who's competing? Who's winning? Zooming from nations to laboratories to individual models, and the philosophical divide between open and closed.
Days at the Top
How long each model held the #1 GPQA spot before being dethroned. Watch the bars get shorter as the race accelerates.
Dominance duration shrinks as competition intensifies. The faster things move, the harder any lead is to maintain.
This compression of tenure at the top follows a power law: early advantages compound, but so do the efforts to overcome them. When multiple well-funded teams pursue the same objective, breakthroughs arrive from unexpected directions.
Open vs Closed Releases
What proportion of new models are open source vs proprietary? Track the philosophical divide quarter by quarter.
Open systems eventually catch closed ones. The lag shrinks as knowledge diffuses and techniques become reproducible.
Proprietary advantages erode when the underlying science is understood. Once a capability is proven possible, independent teams can reverse-engineer the approach. The question shifts from 'if' to 'how soon.'
Act 3
The Technology
How is AI actually improving? Understanding capabilities, engineering constraints, and the tradeoffs that define the frontier.
The Multimodal Shift
What percentage of new models are multimodal vs text-only? Vision is now table stakes.
Capabilities stack. Once a modality is solved, it becomes a baseline expectation rather than a differentiator.
Each new input type—vision, audio, video—follows the same adoption curve: initially rare, then common, then required. Systems that can reason across modalities unlock compound capabilities unavailable to single-mode approaches.
Lab Progress
Which labs improved the most? GPQA score change over the last 12 months, ranked by total gain.
Pareto Frontiers
The tradeoffs between performance and cost, speed, and model size across different benchmarks.
Act 4
The Economics
What does intelligence cost? The practical implications of AI pricing, and why the economics matter more than ever.
Cost deflation in compute-intensive industries follows exponential curves, not linear ones.
When infrastructure scales and competition increases, prices don't just fall—they collapse. Each order-of-magnitude reduction unlocks new use cases that were previously economically impossible, expanding the total addressable market.
Capability vs Price
SWE-Bench SOTA capability (line) vs average model price (bars) over time. Same price, more capability.
Act 5
The Measurement Problem
Are we even measuring the right things? A meta-analysis of benchmarks, human preferences, and the gap between what we test and what we value.
Benchmark Saturation
Tracking which benchmarks are approaching saturation vs still challenging. Red zone indicates benchmarks that may no longer differentiate model capabilities.
Every measurement instrument has a ceiling. When systems exceed it, we learn about the test, not the system.
Benchmark saturation is a signal of progress, but also a limitation. A test that everyone passes reveals nothing about relative capability. The search for harder evaluations is itself a form of scientific inquiry.
The Benchmark Genome
Correlation matrix revealing which benchmarks measure similar capabilities.
Do Humans Agree With Benchmarks?
Comparing human arena ratings with benchmark scores. R² reveals how well benchmarks predict what users actually prefer.
Objective performance and subjective preference measure different things. Both matter.
Test scores capture what a system can do in controlled conditions. Human preference captures how it feels to use. The gap between them reveals the difference between capability and experience—a distinction that matters for real-world deployment.
The State of AI in
The pace of change in AI statistics is hard to overstate. Models that topped benchmarks six months ago are now middle of the pack. New AI growth trends are showing up in reasoning depth, multimodal understanding, and raw efficiency.
Much of this AI industry growth comes from labs competing on every front. OpenAI, Anthropic, Google, and Meta keep raising the bar, while Mistral, DeepSeek, and Alibaba release open-weight models that perform surprisingly well. We track these shifts across 500+ models and 50+ benchmarks in our LLM statistics.
Key AI Statistics
The forces shaping how AI models improve
US vs China AI Race
US labs like OpenAI, Anthropic, and Google still lead most benchmarks. But Chinese labs (DeepSeek, Alibaba, ByteDance) are closing in fast, especially on reasoning and coding tasks.
Open vs Closed Source
The gap is shrinking. Llama, Mistral, and Qwen now match or beat GPT-4 on several benchmarks. You can run capable models locally that would have required API access a year ago.
Falling Inference Costs
Prices keep dropping. GPT-4-level performance cost $30/M tokens in 2023. Today you can get it for under $1/M. Competition and better infrastructure are driving 10-100x reductions each year.
Parameter Efficiency
Smaller models are catching up. A 7B model today can hit scores that took 70B+ parameters last year. This means you can run strong models on a laptop or deploy them affordably.
Understanding AI Benchmark Statistics
AI benchmark statistics give you a way to compare models on specific tasks. GPQA tests graduate-level science reasoning. HumanEval measures code generation. MMLU covers broad knowledge. Each benchmark tells you something different about AI performance data.
When you look at LLM growth rate across these benchmarks, the improvement is clear. GPQA scores went from around 50% to 75%+ in just 18 months. That kind of language model growth will likely continue, though some benchmarks are starting to saturate.
Frequently Asked Questions
Common questions about AI statistics, growth trends, and industry data
What are the current AI growth trends?
How do US and China compare in AI development?
Are open-source AI models catching up to proprietary ones?
How fast are AI inference costs decreasing?
What AI statistics does LLM Stats track?
Compare AI models
Free side-by-side comparisons
Explore More
Dive deeper into AI statistics, benchmarks, and comparisons








