
Claude Opus 4.8 Release, Benchmarks And More
Claude Opus 4.8 scores 88.6% on SWE-bench Verified, 74.6% on Terminal-Bench 2.1, 1890 Elo on GDPval-AA, with parallel-subagent workflows and a 2.5x fast mode. Same $5/$25 pricing.
Your daily source for LLM news, open source LLM updates, and large language model news. Breaking announcements, new AI model releases, LLM benchmark news, and the latest updates from the AI industry.
Read by people at OpenAI, Anthropic, Google, Meta — and 400,000+ more.
Recent papers from arXiv in AI, NLP, and Machine Learning
No new papers today
arXiv updates on weekdays
Free head-to-head playgrounds across image, video, website, game and chat modalities.
LLM evaluation news and benchmark results. Find the best AI model for coding, math, reasoning, and more
Stay informed with large language model news today. The LLM ecosystem has evolved dramatically, with over 500 models now available across commercial APIs and open source LLM releases. From OpenAI's GPT-4 series to Anthropic's Claude, Google's Gemini, and Meta's Llama family, developers tracking AI model updates have unprecedented choice when selecting a model.
Our LLM benchmark news covers evaluations like GPQA (graduate-level reasoning), HumanEval (code generation), and MMLU (multitask understanding). LLM evaluation news helps you compare capabilities, though real-world performance depends on your specific use case.
LLM research updates, large language model evaluation news, leaderboards, and AI model insights
A challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Questions are Google-proof and extremely difficult, with PhD experts reaching 65% accuracy.
A more robust and challenging multi-task language understanding benchmark that extends MMLU by expanding multiple-choice options from 4 to 10, eliminating trivial questions, and focusing on reasoning-intensive tasks. Features over 12,000 curated questions across 14 domains and causes a 16-33% accuracy drop compared to original MMLU.
All 30 problems from the 2025 American Invitational Mathematics Examination (AIME I and AIME II), testing olympiad-level mathematical reasoning with integer answers from 000-999. Used as an AI benchmark to evaluate large language models' ability to solve complex mathematical problems requiring multi-step logical deductions and structured symbolic reasoning.
Massive Multitask Language Understanding benchmark testing knowledge across 57 diverse subjects including STEM, humanities, social sciences, and professional domains
A verified subset of 500 software engineering problems from real GitHub issues, validated by human annotators for evaluating language models' ability to resolve real-world coding issues by generating patches for Python codebases.
Humanity's Last Exam (HLE) is a multi-modal academic benchmark with 2,500 questions across mathematics, humanities, and natural sciences, designed to test LLM capabilities at the frontier of human knowledge with unambiguous, verifiable solutions
LiveCodeBench is a holistic and contamination-free evaluation benchmark for large language models for code. It continuously collects new problems from programming contests (LeetCode, AtCoder, CodeForces) and evaluates four different scenarios: code generation, self-repair, code execution, and test output prediction. Problems are annotated with release dates to enable evaluation on unseen problems released after a model's training cutoff.
MATH dataset contains 12,500 challenging competition mathematics problems from AMC 10, AMC 12, AIME, and other mathematics competitions. Each problem includes full step-by-step solutions and spans multiple difficulty levels (1-5) across seven mathematical subjects including Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus.
Live model battles across chat, coding, image, video, and audio modalities

Claude Opus 4.8 scores 88.6% on SWE-bench Verified, 74.6% on Terminal-Bench 2.1, 1890 Elo on GDPval-AA, with parallel-subagent workflows and a 2.5x fast mode. Same $5/$25 pricing.

Three years of research, 50 frontier models, one stubborn finding: where you put a piece of information in the prompt changes the answer more than what is in it. The geometry, the evidence, and what to do about it.

Gemini 3.5 Flash is GA today. Frontier-level intelligence at 4x the speed of comparable models. $1.50/$9 per 1M tokens, 1M context, 76.2% Terminal-Bench 2.1, beats Gemini 3.1 Pro on coding and agents.
A practical guide to choosing the right LLM
Identify your primary task—code generation (HumanEval, SWE-bench), mathematical reasoning (MATH, GSM8K), or general knowledge (MMLU). Different benchmarks measure different capabilities.
API pricing ranges from $0.15/M tokens for lightweight models to $60+/M for frontier models. Use our comparison tool to find the best ratio.
Smaller models like GPT-4o-mini or Claude 3.5 Haiku offer faster responses. Reasoning models (o1, DeepSeek-R1) trade latency for accuracy on complex tasks.
Benchmarks provide signals, but real performance depends on your prompts. Create an evaluation set from actual use cases. Our AI Arena enables side-by-side comparison.
Common questions about LLM news today, open source LLM updates, and AI model releases
Large language model news, open source LLM updates, AI model comparisons, and benchmark analysis
Compare 500+ models across benchmarks. Real-time rankings updated daily.
Apache, MIT & permissive licenses
Side-by-side analysis
HumanEval, SWE-bench & more
MATH, GSM8K benchmarks
GPQA, MMLU, HumanEval, MATH, and 50+ more evaluations
Pricing, latency & throughput
Discussions & insights