LLM News Today

Your daily source for LLM news and AI news. Breaking announcements, new model releases, benchmark results, and the latest headlines from the AI industry.

LLM Update News

From AI labs, research & community

Live
110 of 65

New LLM Releases This Week

LLM Leaderboard

Latest large language model updates with benchmark scores

No releases this week

Browse all models

Open Source LLM News

Open LLM Leaderboard

Latest open source LLM releases and updates today

No OSS releases this week

Explore open-source

Open Source LLMs

Open-source models have transformed the AI landscape, offering full control over infrastructure. Models like Llama 3, Mistral, and Qwen now rival proprietary alternatives on many benchmarks while providing flexibility to fine-tune, self-host, and customize for specific domains.

Key considerations include licensing terms (Apache 2.0, MIT, or custom licenses), parameter count (affecting inference costs), quantization support for efficient deployment, and the community ecosystem of fine-tuned variants and tooling.

LLM Benchmark Leaderboards

Find the best AI model for your use case across coding, math, reasoning, and more

Understanding the LLM Landscape

The large language model ecosystem has evolved dramatically, with over 500 models now available across commercial APIs and open-source releases. From OpenAI's GPT-4 series to Anthropic's Claude, Google's Gemini, and Meta's Llama family, developers now have unprecedented choice when selecting an AI model.

Benchmark evaluations like GPQA (graduate-level reasoning), HumanEval (code generation), and MMLU (multitask understanding) provide standardized ways to compare capabilities. However, real-world performance depends on your specific use case—a model that excels at coding may not be ideal for creative writing or domain-specific reasoning.

50+ benchmarks·500+ models·Updated hourly

LLM Benchmarks & Resources

Comprehensive large language model evaluation, leaderboards, and research insights

Top LLM Benchmarks

All Benchmarks

GPQA

generalreasoning

A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, with PhD experts reaching 65% accuracy.

159 models
View

MMLU

generallanguage

Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains

93 models
View

MMLU-Pro

generallanguage

A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, eliminating trivial questions, and focusing on reasoning-intensive tasks. Features over 12,000 curated questions across 14 domains and causes a 16-33% accuracy drop compared to original MMLU.

87 models
View

AIME 2025

mathreasoning

All 30 problems from the 2025 American Invitational Mathematics Examination (AIME I and AIME II), testing olympiad-level mathematical reasoning with integer answers from 000-999. Used as an AI benchmark to evaluate large language models' ability to solve complex mathematical problems requiring multi-step logical deductions and structured symbolic reasoning.

78 models
View

MATH

mathreasoning

MATH dataset contains 12,500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions. Each problem includes full step-by-step solutions and spans multiple difficulty levels (1-5) across seven mathematical subjects including Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.

67 models
View

HumanEval

codereasoning

A benchmark that measures functional correctness for synthesizing programs from docstrings, consisting of 164 original programming problems assessing language comprehension, algorithms, and simple mathematics

63 models
View

LiveCodeBench

codegeneral

LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.

62 models
View

MMMU

generalhealthcare

MMMU (Massive Multi-discipline Multimodal Understanding) is a benchmark designed to evaluate multimodal models on college-level subject knowledge and deliberate reasoning. Contains 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering across 30 subjects and 183 subfields.

55 models
View

AI Arenas

Live model battles across chat, coding, image, video, and audio modalities

Evaluating AI Models

A practical guide to choosing the right LLM

Define your use case

Identify your primary task—code generation (HumanEval, SWE-bench), mathematical reasoning (MATH, GSM8K), or general knowledge (MMLU). Different benchmarks measure different capabilities.

Consider cost vs. performance

API pricing ranges from $0.15/M tokens for lightweight models to $60+/M for frontier models. Use our comparison tool to find the best ratio.

Evaluate latency & throughput

Smaller models like GPT-4o-mini or Claude 3.5 Haiku offer faster responses. Reasoning models (o1, DeepSeek-R1) trade latency for accuracy on complex tasks.

Test with your own data

Benchmarks provide signals, but real performance depends on your prompts. Create an evaluation set from actual use cases. Our AI Arena enables side-by-side comparison.

Frequently Asked Questions

Common questions about LLM news, updates, and benchmarks

What are the latest LLM updates today?

LLM Stats aggregates the latest large language model updates from major AI labs including OpenAI, Anthropic, Google, Meta, and others. Our news feed is updated hourly with new model releases, benchmark results, and AI research announcements. Check the LLM Update News section above for today's headlines.

Where can I find open source LLM news today?

Our Open Source LLM News section tracks new open-weight language model releases including models with Apache, MIT, and permissive licenses. We monitor releases from organizations like Meta (Llama), Mistral, and the open source AI community. For a complete ranking, visit our Open LLM Leaderboard.

How do I compare LLM benchmark scores?

LLM Stats provides comprehensive benchmark comparisons across popular evaluations like GPQA, MMLU, HumanEval, and more. Visit our LLM Leaderboard to compare models side-by-side, or check the LLM Benchmark News section for the latest evaluation results and performance leaders.

What new large language models were released this week?

Our New LLM Releases This Week section shows the most recent large language model releases with their benchmark performance scores. This includes both proprietary models from companies like OpenAI and Anthropic, as well as open source releases. For historical data, check our New Models page.

Explore LLM Stats

Dive deeper into large language model data, comparisons, and analysis