Back to blog
Model Release·Technical Deep Dive

Claude Opus 4.7: Benchmarks, Pricing, Context & What's New

Claude Opus 4.7 scores 87.6% on SWE-bench Verified, 94.2% on GPQA, 1M token context, 3.3x higher-resolution vision, new xhigh effort level. $5/$25 pricing.

Jonathan Chavez
Jonathan Chavez
Co-Founder @ LLM Stats
·11 min read
Claude Opus 4.7: Benchmarks, Pricing, Context & What's New

Key Numbers

Opus 4.7 · Apr 16, 2026

0.0%
SWE-bench Verified
0.0%
GPQA Diamond
0.0%
Finance Agent (SOTA)
0.0%
Terminal-Bench 2.0
0.0%
OSWorld-Verified
0M
Context Window

Effort Levels

low  →  max

low
medium
high
xhighNEW
max

Opus 4.7 adds xhigh between high and max, giving finer control over reasoning depth without the full latency of max.

Anthropic released Claude Opus 4.7 on April 16, 2026. It's a direct upgrade to Opus 4.6 at the same price ($5 / $25 per million input / output tokens), with meaningful gains on the hardest software engineering tasks, a new xhigh effort level, 3.3x higher-resolution vision, and better file-system memory across multi-session agent work.

The headline number: 87.6% on SWE-bench Verified, up from 80.8% on Opus 4.6. Other notable results include 94.2% on GPQA Diamond, 69.4% on Terminal-Bench 2.0, and a state-of-the-art 64.4% on Finance Agent. Opus 4.7 sits below Claude Mythos Preview in raw capability, but it's the first broadly released model carrying safeguards learned from the Project Glasswing deployment.


At a Glance

  • Release date: April 16, 2026. Generally available.
  • Model ID: claude-opus-4-7 on the Claude API.
  • Pricing: $5 per 1M input tokens, $25 per 1M output tokens. Same as Opus 4.6.
  • Context window: 1M input tokens / 128K output tokens.
  • Modalities: Text + vision input (images up to 2,576px long edge, ~3.75 MP).
  • Effort levels: low / medium / high / xhigh (new) / max.
  • Deployment: Claude.ai, Claude API, Amazon Bedrock, Google Vertex AI, Microsoft Foundry.
  • Tokenizer: Updated. Same text maps to ~1.0–1.35x more tokens than Opus 4.6.
  • Cybersecurity safeguards: Automated safeguards block prohibited/high-risk cyber uses; verified program for legit security work.

What's New in 4.7

Four changes define the Opus 4.7 upgrade. None are headline architectural overhauls. Together they shift the model toward long-horizon agentic reliability.

Self-verification before reporting

Anthropic's framing is specific: Opus 4.7 "devises ways to verify its own outputs before reporting back." In practice this means the model proactively writes tests, runs sanity checks, and inspects its own output rather than declaring a task complete and handing it back. Vercel reports 4.7 "does proofs on systems code before starting work" — new behavior not seen in prior Claude models.

Agent Loop

4.6  →  4.7

Opus 4.6
Plan
Execute
Report
Opus 4.7
Plan
Execute
VerifyNEW
Report
Opus 4.7 verifies its own outputs before reporting back. On long-running agentic work, this cuts double-digit error rates on tasks where 4.6 would otherwise report confidently incorrect results.

Literal instruction following

Opus 4.7 follows instructions more literally than any previous Claude model. Anthropic explicitly flags this as a migration concern: prompts written for earlier models that relied on loose interpretation may now produce unexpected results because Opus 4.7 takes the wording at face value. Re-tuning prompts and harnesses is recommended.

File-system memory for multi-session work

Opus 4.7 is better at reading, writing, and reusing notes on a persistent file system across sessions. For agents working over days rather than minutes, this removes the need to re-establish context at the start of every run.

Higher-resolution vision

Images up to 2,576 pixels on the long edge (~3.75 megapixels) — more than 3.3x the resolution of prior Claude models. This is a model-level change applied automatically: images sent to the API are simply processed at higher fidelity. XBOW reports a 98.5% visual-acuity benchmark (vs 54.5% for Opus 4.6), enabling autonomous pen-testing workflows that weren't viable before.


Benchmarks

All scores are self-reported by Anthropic in the launch announcement. SWE-bench results include memorization-screen adjustments; Anthropic states Opus 4.7's margin over 4.6 holds when flagged items are excluded.

Benchmarks

Opus 4.7Opus 4.6
MCP-Atlas
77.362.7+14.6
CharXiv-R (with tools)
91.077.4+13.6
SWE-bench Pro
64.353.4+10.9
SWE-bench Verified
87.680.8+6.8
OSWorld-Verified
78.072.7+5.3
Terminal-Bench 2.0
69.465.4+4.0
Finance AgentSOTA
64.460.7+3.7
GPQA Diamond
94.291.3+2.9
HLE (with tools)
54.753.1+1.6
BrowseComp
79.384.0-4.7
Self-reported by Anthropic. Opus 4.6 SWE-bench scores reflect memorization-screen adjustments; margin holds. Scores on a 0–100 scale.

Agentic coding

BenchmarkOpus 4.7Opus 4.6Delta
SWE-bench Verified87.6%80.8%+6.8
SWE-bench Pro64.3%53.4%+10.9
Terminal-Bench 2.069.4%65.4%+4.0

The jump on SWE-bench Pro (+10.9 points) is larger than on SWE-bench Verified, suggesting the improvement is concentrated on harder, less-saturated problems. Anthropic's partner testimonials back this: Replit reports same-quality output at lower cost, Rakuten measured 3x more production tasks resolved, and Cursor reports 70% on CursorBench vs 58% for Opus 4.6.

Reasoning & knowledge

BenchmarkOpus 4.7Opus 4.6Delta
GPQA Diamond94.2%91.3%+2.9
HLE (with tools)54.7%53.1%+1.6
HLE (without tools)46.9%40.0%+6.9
MMMLU91.5%91.1%+0.4

Agents: browse, tools, computer use

BenchmarkOpus 4.7Opus 4.6Delta
BrowseComp79.3%84.0%-4.7
MCP-Atlas77.3%62.7%+14.6
OSWorld-Verified78.0%72.7%+5.3
Finance Agent SOTA64.4%60.7%+3.7
CyberGym73.1%73.8%-0.7

The +14.6 point jump on MCP-Atlas is the largest single improvement in the agentic suite and lines up with the model's literal-instruction-following behavior. BrowseComp dropped 4.7 points (the headline regression); Opus 4.6 scored 84.0% under a multi-agent harness at max effort, so the comparison is sensitive to harness choice. CyberGym is effectively flat — by design, since Anthropic states they "experimented with efforts to differentially reduce" cyber capabilities during training.

Vision

BenchmarkOpus 4.7Opus 4.6Delta
CharXiv-R (with tools)91.0%77.4%+13.6
CharXiv-R (without tools)82.1%68.7%+13.4

Visual reasoning is the largest domain jump: roughly +13 points on CharXiv-R both with and without tools. Paired with the resolution upgrade below, this reshapes what computer-use and document-analysis agents can attempt.


Vision & Multimodal

The vision bump is one of the more tangible changes. Prior Claude models capped input images at roughly 1,568 pixels on the long edge (~1.15 megapixels). Opus 4.7 accepts images up to 2,576 pixels on the long edge, at ~3.75 megapixels. That's 3.3x more pixel area available to the model for a single image.

Vision

Max image resolution

Images at higher fidelity.

1.15 MP
Prior Claude models
1568 px · long edge
3.3×
more pixel area
3.75 MP
Opus 4.7
2576 px · long edge
Higher resolution unlocks dense screenshot reading, complex diagram extraction, and pixel-perfect visual references without operator-side pre-cropping. Applied automatically through the vision API.

Because this is a model-level change rather than an API parameter, images users already send to Claude are processed at higher fidelity automatically. Two practical consequences:

  • Computer-use agents can read dense screenshots without the operator pre-cropping or zooming. XBOW reports this closes the biggest gap between Claude and a human pen-tester looking at the same UI.
  • Data extraction from complex diagrams improves sharply. CharXiv-R without tools rose from 68.7% to 82.1% — the no-tools result isolates the vision capability itself rather than the image-cropping tool.
  • Higher-resolution images consume more tokens. Users who don't need the extra detail can downsample before sending.

Effort Levels & Task Budgets

The new xhigh level

Effort levels control how much the model thinks before acting. Opus 4.6 exposed low / medium / high / max. Opus 4.7 inserts xhigh between high and max, giving developers a middle ground for reasoning-heavy work without committing to the full latency of max effort.

Anthropic recommends starting with high or xhigh for coding and agentic use cases. In Claude Code, the default effort level has been raised to xhigh for all plans.

Task budgets (public beta)

New on the Claude Platform: task budgets, a way to cap and prioritize token spend across long-running jobs. Paired with xhigh, budgets let developers say "think hard on this, but don't burn more than N tokens finishing it." Useful for multi-agent workflows where letting one branch spiral is expensive.

Claude Code additions

  • /ultrareview: dedicated review session that reads through diffs and flags bugs or design issues a careful human reviewer would catch. Three free runs for Pro and Max.
  • Auto mode for Max users: permissions preset where Claude makes decisions on the user's behalf — longer runs, fewer interruptions, less risk than fully skipping permissions.

Pricing & Availability

DetailValue
Input price$5.00 / 1M tokens (≤200K prompts)
Output price$25.00 / 1M tokens (≤200K prompts)
Long-prompt input (>200K)$10.00 / 1M tokens
Long-prompt output (>200K)$37.50 / 1M tokens
Max input context1M tokens
Max output128K tokens
PlatformsClaude API, Amazon Bedrock, Vertex AI, Microsoft Foundry
Model IDclaude-opus-4-7

Prices match Opus 4.6. The pitch is straightforward: same cost per token, higher capability per token, and low-effort 4.7 matches medium-effort 4.6 according to Hex's early testing. For teams running Opus at volume, the effective price per completed task drops meaningfully even though the per-token rate is unchanged.

See Anthropic's pricing page for rate-limit details and batch/caching discounts.


Migrating from Opus 4.6

Opus 4.7 is a drop-in upgrade on the API surface, but two changes affect token usage and are worth planning for.

Updated tokenizer

Opus 4.7 ships with an updated tokenizer. The same input text can map to ~1.0–1.35x more tokens than Opus 4.6, depending on content type. Code and technical text tend to land near the lower end of that range; heavily structured or multilingual content near the upper end. Budget and cost forecasting built against 4.6 should be re-measured on real traffic.

More thinking at higher effort

Opus 4.7 thinks more at higher effort levels, particularly on later turns of agentic runs. Output token counts can rise as a result. Anthropic's own internal coding evaluation shows token usage per completed task is improved at every effort level, because accuracy rises faster than token spend — but that's the net aggregate, not a guarantee for every workload.

Prompt re-tuning

Because 4.7 follows instructions more literally, prompts written for earlier Claude models may need revision. The most common failure mode: bullet lists of suggestions that prior models treated as "optional hints" are now treated as hard requirements. Audit system prompts before flipping the model flag at scale.

Anthropic's migration guide includes concrete tuning advice and token-usage comparisons per effort level.


Cybersecurity Safeguards

Opus 4.7 is the first broadly-released model to ship with the Project Glasswing safeguard stack. During training, Anthropic explicitly experimented with reducing offensive cyber capabilities relative to raw model scale, and the release includes automated systems that detect and block requests indicating prohibited or high-risk cybersecurity uses.

On CyberGym, Opus 4.7 scores 73.1% — effectively flat against Opus 4.6's revised 73.8%. This flat line is a policy choice, not a capability ceiling: Mythos Preview scores 83.1% on the same benchmark but remains restricted to vetted Glasswing partners.

Security researchers working on legitimate offensive-security tasks (vulnerability research, penetration testing, red-teaming) can apply to the new Cyber Verification Program to access Opus 4.7 without triggering the default refusal behavior.


Outlook

Opus 4.7 is an incremental release by Anthropic's own framing — a direct upgrade to 4.6 rather than a new model tier. The version number signals it: 4.6 to 4.7, not 4.6 to 5.0. Mythos Preview remains the ceiling on Anthropic's frontier capability, and this release deliberately keeps some distance from it.

What makes 4.7 interesting is less the benchmark deltas and more the operational shape: literal instruction following, self-verification, higher-resolution vision, and file-system memory. These are the specific behaviors that separate a model you have to babysit from one you can hand a multi-hour task and walk away from. Devin, Factory, Ramp, and Notion's partner testimonials all land on the same point: fewer tool errors, less step-by-step guidance, longer autonomous runs.

For the full announcement and per-benchmark methodology, see Anthropic's launch post and system card.

Questions

Frequently Asked Questions

  • Anthropic released Claude Opus 4.7 on April 16, 2026. It is generally available across Claude products, the Claude API (claude-opus-4-7), Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.
  • Claude Opus 4.7 pricing is $5 per million input tokens and $25 per million output tokens, unchanged from Opus 4.6. Note that Opus 4.7 uses an updated tokenizer, so the same input can map to roughly 1.0–1.35x more tokens than Opus 4.6.
  • Opus 4.7 supports a 1 million token input context window with up to 128K output tokens, matching Opus 4.6. Prompts above 200K tokens are charged at a premium rate on the Claude API.
  • Opus 4.7 beats Opus 4.6 across the reported benchmark suite: 87.6% vs 80.8% on SWE-bench Verified, 69.4% vs 65.4% on Terminal-Bench 2.0, 94.2% vs 91.3% on GPQA Diamond, and 64.4% vs 60.7% on Finance Agent (state-of-the-art at release). Early-access testers report low-effort 4.7 matches medium-effort 4.6.
  • xhigh is a new effort level introduced in Opus 4.7, sitting between high and max. It gives developers finer control over the reasoning-vs-latency tradeoff. Claude Code now defaults to xhigh for all plans.
  • Opus 4.7 accepts images up to 2,576 pixels on the long edge (~3.75 megapixels), more than 3.3x the resolution of prior Claude models. This is a model-level change applied automatically through the vision API.

Continue Reading