Claude Opus 4.8 Release, Benchmarks And More
Claude Opus 4.8 scores 88.6% on SWE-bench Verified, 74.6% on Terminal-Bench 2.1, 1890 Elo on GDPval-AA, with parallel-subagent workflows and a 2.5x fast mode. Same $5/$25 pricing.

Key Numbers
Opus 4.8 · May 28, 2026
Speed & Price
same model · two modes
Standard pricing is unchanged from Opus 4.7. The optional fast mode runs at 2.5× the speed for double the per-token rate, and it is three times cheaper than fast mode on previous Claude models.
Anthropic released Claude Opus 4.8 on May 28, 2026. It is a direct upgrade to Opus 4.7 at the same price ($5 / $25 per million input / output tokens), and Anthropic positions it as its most capable general-access model at release. The benchmark gains are real but modest. The more interesting changes are in how you run it.
The headline number: 88.6% on SWE-bench Verified, up from 87.6% on Opus 4.7. Around it sit 74.6% on Terminal-Bench 2.1, 93.6% on GPQA Diamond, and a leading 1890 Elo on GDPval-AA. But the release is defined by four operational shifts: parallel-subagent dynamic workflows in Claude Code, mid-task system messages on the Messages API, an optional 2.5x fast mode, and measurable honesty improvements in the alignment assessment.
At a Glance
- Release date: May 28, 2026. Generally available.
- Model ID:
claude-opus-4-8on the Claude API. - Pricing: $5 per 1M input tokens, $25 per 1M output tokens. Same as Opus 4.7.
- Fast mode: ~2.5x speed at $10 / $50 per 1M tokens (optional).
- Context window: 1M input tokens / 128K output tokens.
- Modalities: Text + vision input, text output.
- Effort: Defaults to high; xhigh and max available for harder problems.
- Claude Code: Dynamic workflows with parallel subagents for codebase-scale migrations.
- Deployment: Claude.ai, Claude API, Amazon Bedrock, Google Vertex AI, Microsoft Foundry.
What's New in 4.8
None of these are architectural overhauls. Together they push Opus toward longer-horizon, lower-supervision agentic work.
Dynamic workflows in Claude Code
The marquee feature. Opus 4.8 can spin up parallel subagents that each plan, execute, and verify part of a task, coordinated by an orchestrator that merges their results. Where a single agent loop processes a large refactor sequentially, dynamic workflows split it across agents working at once.
Dynamic Workflows
one task · many agents
plans, fans out,
merges results
Anthropic frames the target use case as codebase-scale migrations: the kind of multi-file, multi-hour change where the bottleneck is throughput, not raw reasoning. The same effort control now extends to claude.ai and Cowork.
Mid-task system messages
A quieter but consequential developer change: the Messages API now accepts system entries inside the messages array, not just the top-level system parameter. Harnesses can update instructions partway through a task without breaking the prompt cache. For long agentic runs this means you can steer the model mid-flight, then keep paying cached-input rates on everything that came before.
Fast mode
Opus 4.8 introduces an optional fast mode that runs at roughly 2.5x the standard speed for double the per-token rate ($10 / $50 per million tokens). Notably it is three times cheaper than fast mode on previous Claude models, which makes interactive, latency-sensitive use of a frontier Opus model far more practical.
Higher default effort
Opus 4.8 defaults to high effort, with xhigh and max available for the hardest problems. The practical implication is the same as every recent Opus release: budget for more output tokens at the top effort levels, and measure token-per-task on your own traffic rather than assuming the aggregate holds.
Benchmarks
All scores are self-reported by Anthropic in the launch announcement and system card. Two benchmarks changed versions at 4.8 (Terminal-Bench 2.0 to 2.1, Finance Agent v1 to v2), so those are shown as standalone scores rather than deltas.
4.8 vs 4.7
Agentic coding
| Benchmark | Opus 4.8 | Opus 4.7 | Delta |
|---|---|---|---|
| SWE-bench Verified | 88.6% | 87.6% | +1.0 |
| SWE-bench Pro | 69.2% | 64.3% | +4.9 |
| SWE-bench Multilingual | 84.4% | — | — |
| Terminal-Bench 2.1 | 74.6% | 69.4% (2.0) | n/a |
The +4.9 point jump on SWE-bench Pro is the real coding signal: SWE-bench Verified is approaching saturation, so the harder, less-saturated set is where headroom remains. Terminal-Bench moved to version 2.1, so its 74.6% is not directly comparable to 4.7's 2.0 score.
Reasoning & knowledge
| Benchmark | Opus 4.8 | Opus 4.7 | Delta |
|---|---|---|---|
| GPQA Diamond | 93.6% | 94.2% | -0.6 |
| HLE (with tools) | 57.9% | 54.7% | +3.2 |
| HLE (without tools) | 49.8% | 46.9% | +2.9 |
| USAMO 2026 | 96.7% | — | — |
| GDPval-AA Elo | 1890 | — | — |
GPQA Diamond is flat within noise (both models sit above 93%), which is what saturation looks like. The clearer gains are on Humanity's Last Exam (+3.2 with tools) and the knowledge-work GDPval-AA evaluation from Artificial Analysis, where Opus 4.8 leads at 1890 Elo.
Agents: browse, tools, computer use
| Benchmark | Opus 4.8 | Opus 4.7 | Delta |
|---|---|---|---|
| BrowseComp (single-agent) | 84.3% | 79.3% | +5.0 |
| BrowseComp (multi-agent) | 88.5% | — | — |
| MCP-Atlas | 82.2% | 77.3% | +4.9 |
| OSWorld-Verified | 83.4% | 78.0% | +5.4 * |
| ScreenSpot-Pro | 87.9% | — | — |
| DeepSearchQA | 93.1% | — | — |
| Toolathlon | 59.9% | — | — |
Agentic browsing and tool use are where Opus 4.8 moves most: +5.0 on BrowseComp single-agent (rising to 88.5% with a multi-agent orchestrator) and +4.9 on MCP-Atlas. The OSWorld-Verified gain (*) comes partly from an updated harness (zoom-tool fix, 128K max tokens per turn), so read it as a methodology + model improvement rather than a clean apples-to-apples delta.
Long context
| Benchmark | 1M subset | 256K subset |
|---|---|---|
| GraphWalks BFS | 68.1% | 85.9% |
| GraphWalks Parents | 83.3% | 99.3% |
The 256K results are strong; the 1M-token subset shows the usual degradation at the edge of the window. As always, treat the advertised 1M context as a ceiling, not a working budget.
Alignment & Honesty
The most distinctive part of this release is not a capability number. Anthropic's alignment assessment reports that Opus 4.8 is measurably more honest about its own work, which matters precisely because the rest of the release pushes the model to run longer with less supervision.
Alignment · Honesty
fewer unflagged flaws in self-written code
vs Opus 4.7
fewer dishonest agentic code summaries
vs Sonnet 4.6
Two results stand out. Opus 4.8 lets flaws in its own code pass unremarked roughly four times less often than Opus 4.7, and produces dishonest summaries of agentic coding work about seventeen times less often than Claude Sonnet 4.6. Anthropic also reports broadly improved adherence to Claude's constitution. These are reductions in the rate of dishonest behavior, not eliminations, but for unattended multi-agent runs the direction is the one that counts.
Pricing & Availability
| Detail | Value |
|---|---|
| Input price | $5.00 / 1M tokens |
| Output price | $25.00 / 1M tokens |
| Fast mode input | $10.00 / 1M tokens (~2.5x speed) |
| Fast mode output | $50.00 / 1M tokens (~2.5x speed) |
| Max input context | 1M tokens |
| Max output | 128K tokens |
| Platforms | Claude API, Amazon Bedrock, Vertex AI, Microsoft Foundry |
| Model ID | claude-opus-4-8 |
Standard pricing matches Opus 4.7. The pitch is the same as every recent Opus: same cost per token, more capability per token. Fast mode is the new lever, trading double the rate for 2.5x throughput, and it lands three times cheaper than the previous generation's fast tier. See Anthropic's pricing page for rate limits and batch/caching discounts.
Migrating from 4.7
Opus 4.8 is a drop-in swap on the API surface. Three things are worth checking before you flip the model flag at scale.
Vertex availability
At launch the model id resolves cleanly on the Claude API and Bedrock, but Google Cloud Vertex AI may lag by a short window before the publisher model is exposed. If you route through Vertex, confirm claude-opus-4-8 resolves there before cutting traffic over, and keep the direct Anthropic route as the primary until it does.
Mid-task system messages
If your harness re-sends a full system prompt on every turn to inject new instructions, you can now move those updates into in-array system messages and stop invalidating the prompt cache. This is opt-in: existing code keeps working unchanged.
Effort and token budgets
With high effort as the default and parallel subagents in the mix, output token consumption can rise on agentic workloads. Re-measure token-per-task on real traffic, and lean on fast mode where latency matters more than the marginal cost.
Outlook
Opus 4.8 is an incremental model release wrapped around a more interesting platform release. The benchmark deltas over 4.7 are small, and a couple (GPQA, CharXiv-R) are flat or slightly down, which is what you expect when the headline suites are saturating. The story is the operational surface: parallel-subagent workflows for codebase-scale work, prompt-cache-safe mid-task steering, a genuinely cheaper fast mode, and an alignment assessment that puts numbers on honesty.
That combination, more autonomy plus more honesty about its own output, is the through-line. The benchmarks tell you Opus 4.8 is a little smarter. The workflow and alignment changes tell you Anthropic wants you to hand it bigger jobs and check its work less. For the full announcement and per-benchmark methodology, see Anthropic's launch post and system card.
Questions
Frequently Asked Questions
- Anthropic released Claude Opus 4.8 on May 28, 2026. It is available across Claude products, the Claude API (
claude-opus-4-8), Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. - Claude Opus 4.8 pricing is $5 per million input tokens and $25 per million output tokens, unchanged from Opus 4.7. An optional fast mode runs at 2.5x speed for $10 / $50 per million tokens, three times cheaper than fast mode on previous Claude models.
- Opus 4.8 supports a 1 million token input context window with up to 128K output tokens, matching Opus 4.7.
- Opus 4.8 improves on most of the comparable suite: 88.6% vs 87.6% on SWE-bench Verified, 69.2% vs 64.3% on SWE-bench Pro, 82.2% vs 77.3% on MCP-Atlas, and 84.3% vs 79.3% on BrowseComp (single-agent). GPQA Diamond is effectively flat (93.6% vs 94.2%), and CharXiv-R dips slightly. The headline additions are operational, not benchmark deltas.
- Dynamic workflows let Claude Code spawn parallel subagents that each plan, execute, and verify a slice of a task and report back to an orchestrator. They are built for codebase-scale migrations that a single linear agent loop would grind through slowly.
- Fast mode serves Opus 4.8 at roughly 2.5x the standard speed for double the per-token rate ($10 / $50 per million tokens). It is three times cheaper than fast mode on previous Claude models, aimed at latency-sensitive interactive use.
Continue Reading
