Claude Opus 4.7: Benchmarks, Pricing, Context & What's New
Claude Opus 4.7 scores 87.6% on SWE-bench Verified, 94.2% on GPQA, 1M token context, 3.3x higher-resolution vision, new xhigh effort level. $5/$25 pricing.

Key Numbers
Opus 4.7 · Apr 16, 2026
Effort Levels
low → max
Opus 4.7 adds xhigh between high and max, giving finer control over reasoning depth without the full latency of max.
Anthropic released Claude Opus 4.7 on April 16, 2026. It's a direct upgrade to Opus 4.6 at the same price ($5 / $25 per million input / output tokens), with meaningful gains on the hardest software engineering tasks, a new xhigh effort level, 3.3x higher-resolution vision, and better file-system memory across multi-session agent work.
The headline number: 87.6% on SWE-bench Verified, up from 80.8% on Opus 4.6. Other notable results include 94.2% on GPQA Diamond, 69.4% on Terminal-Bench 2.0, and a state-of-the-art 64.4% on Finance Agent. Opus 4.7 sits below Claude Mythos Preview in raw capability, but it's the first broadly released model carrying safeguards learned from the Project Glasswing deployment.
At a Glance
- Release date: April 16, 2026. Generally available.
- Model ID:
claude-opus-4-7on the Claude API. - Pricing: $5 per 1M input tokens, $25 per 1M output tokens. Same as Opus 4.6.
- Context window: 1M input tokens / 128K output tokens.
- Modalities: Text + vision input (images up to 2,576px long edge, ~3.75 MP).
- Effort levels: low / medium / high / xhigh (new) / max.
- Deployment: Claude.ai, Claude API, Amazon Bedrock, Google Vertex AI, Microsoft Foundry.
- Tokenizer: Updated. Same text maps to ~1.0–1.35x more tokens than Opus 4.6.
- Cybersecurity safeguards: Automated safeguards block prohibited/high-risk cyber uses; verified program for legit security work.
What's New in 4.7
Four changes define the Opus 4.7 upgrade. None are headline architectural overhauls. Together they shift the model toward long-horizon agentic reliability.
Self-verification before reporting
Anthropic's framing is specific: Opus 4.7 "devises ways to verify its own outputs before reporting back." In practice this means the model proactively writes tests, runs sanity checks, and inspects its own output rather than declaring a task complete and handing it back. Vercel reports 4.7 "does proofs on systems code before starting work" — new behavior not seen in prior Claude models.
Agent Loop
4.6 → 4.7
Literal instruction following
Opus 4.7 follows instructions more literally than any previous Claude model. Anthropic explicitly flags this as a migration concern: prompts written for earlier models that relied on loose interpretation may now produce unexpected results because Opus 4.7 takes the wording at face value. Re-tuning prompts and harnesses is recommended.
File-system memory for multi-session work
Opus 4.7 is better at reading, writing, and reusing notes on a persistent file system across sessions. For agents working over days rather than minutes, this removes the need to re-establish context at the start of every run.
Higher-resolution vision
Images up to 2,576 pixels on the long edge (~3.75 megapixels) — more than 3.3x the resolution of prior Claude models. This is a model-level change applied automatically: images sent to the API are simply processed at higher fidelity. XBOW reports a 98.5% visual-acuity benchmark (vs 54.5% for Opus 4.6), enabling autonomous pen-testing workflows that weren't viable before.
Benchmarks
All scores are self-reported by Anthropic in the launch announcement. SWE-bench results include memorization-screen adjustments; Anthropic states Opus 4.7's margin over 4.6 holds when flagged items are excluded.
Benchmarks
Agentic coding
| Benchmark | Opus 4.7 | Opus 4.6 | Delta |
|---|---|---|---|
| SWE-bench Verified | 87.6% | 80.8% | +6.8 |
| SWE-bench Pro | 64.3% | 53.4% | +10.9 |
| Terminal-Bench 2.0 | 69.4% | 65.4% | +4.0 |
The jump on SWE-bench Pro (+10.9 points) is larger than on SWE-bench Verified, suggesting the improvement is concentrated on harder, less-saturated problems. Anthropic's partner testimonials back this: Replit reports same-quality output at lower cost, Rakuten measured 3x more production tasks resolved, and Cursor reports 70% on CursorBench vs 58% for Opus 4.6.
Reasoning & knowledge
| Benchmark | Opus 4.7 | Opus 4.6 | Delta |
|---|---|---|---|
| GPQA Diamond | 94.2% | 91.3% | +2.9 |
| HLE (with tools) | 54.7% | 53.1% | +1.6 |
| HLE (without tools) | 46.9% | 40.0% | +6.9 |
| MMMLU | 91.5% | 91.1% | +0.4 |
Agents: browse, tools, computer use
| Benchmark | Opus 4.7 | Opus 4.6 | Delta |
|---|---|---|---|
| BrowseComp | 79.3% | 84.0% | -4.7 |
| MCP-Atlas | 77.3% | 62.7% | +14.6 |
| OSWorld-Verified | 78.0% | 72.7% | +5.3 |
| Finance Agent SOTA | 64.4% | 60.7% | +3.7 |
| CyberGym | 73.1% | 73.8% | -0.7 |
The +14.6 point jump on MCP-Atlas is the largest single improvement in the agentic suite and lines up with the model's literal-instruction-following behavior. BrowseComp dropped 4.7 points (the headline regression); Opus 4.6 scored 84.0% under a multi-agent harness at max effort, so the comparison is sensitive to harness choice. CyberGym is effectively flat — by design, since Anthropic states they "experimented with efforts to differentially reduce" cyber capabilities during training.
Vision
| Benchmark | Opus 4.7 | Opus 4.6 | Delta |
|---|---|---|---|
| CharXiv-R (with tools) | 91.0% | 77.4% | +13.6 |
| CharXiv-R (without tools) | 82.1% | 68.7% | +13.4 |
Visual reasoning is the largest domain jump: roughly +13 points on CharXiv-R both with and without tools. Paired with the resolution upgrade below, this reshapes what computer-use and document-analysis agents can attempt.
Vision & Multimodal
The vision bump is one of the more tangible changes. Prior Claude models capped input images at roughly 1,568 pixels on the long edge (~1.15 megapixels). Opus 4.7 accepts images up to 2,576 pixels on the long edge, at ~3.75 megapixels. That's 3.3x more pixel area available to the model for a single image.
Vision
Max image resolution
Images at higher fidelity.
Because this is a model-level change rather than an API parameter, images users already send to Claude are processed at higher fidelity automatically. Two practical consequences:
- Computer-use agents can read dense screenshots without the operator pre-cropping or zooming. XBOW reports this closes the biggest gap between Claude and a human pen-tester looking at the same UI.
- Data extraction from complex diagrams improves sharply. CharXiv-R without tools rose from 68.7% to 82.1% — the no-tools result isolates the vision capability itself rather than the image-cropping tool.
- Higher-resolution images consume more tokens. Users who don't need the extra detail can downsample before sending.
Effort Levels & Task Budgets
The new xhigh level
Effort levels control how much the model thinks before acting. Opus 4.6 exposed low / medium / high / max. Opus 4.7 inserts xhigh between high and max, giving developers a middle ground for reasoning-heavy work without committing to the full latency of max effort.
Anthropic recommends starting with high or xhigh for coding and agentic use cases. In Claude Code, the default effort level has been raised to xhigh for all plans.
Task budgets (public beta)
New on the Claude Platform: task budgets, a way to cap and prioritize token spend across long-running jobs. Paired with xhigh, budgets let developers say "think hard on this, but don't burn more than N tokens finishing it." Useful for multi-agent workflows where letting one branch spiral is expensive.
Claude Code additions
/ultrareview: dedicated review session that reads through diffs and flags bugs or design issues a careful human reviewer would catch. Three free runs for Pro and Max.- Auto mode for Max users: permissions preset where Claude makes decisions on the user's behalf — longer runs, fewer interruptions, less risk than fully skipping permissions.
Pricing & Availability
| Detail | Value |
|---|---|
| Input price | $5.00 / 1M tokens (≤200K prompts) |
| Output price | $25.00 / 1M tokens (≤200K prompts) |
| Long-prompt input (>200K) | $10.00 / 1M tokens |
| Long-prompt output (>200K) | $37.50 / 1M tokens |
| Max input context | 1M tokens |
| Max output | 128K tokens |
| Platforms | Claude API, Amazon Bedrock, Vertex AI, Microsoft Foundry |
| Model ID | claude-opus-4-7 |
Prices match Opus 4.6. The pitch is straightforward: same cost per token, higher capability per token, and low-effort 4.7 matches medium-effort 4.6 according to Hex's early testing. For teams running Opus at volume, the effective price per completed task drops meaningfully even though the per-token rate is unchanged.
See Anthropic's pricing page for rate-limit details and batch/caching discounts.
Migrating from Opus 4.6
Opus 4.7 is a drop-in upgrade on the API surface, but two changes affect token usage and are worth planning for.
Updated tokenizer
Opus 4.7 ships with an updated tokenizer. The same input text can map to ~1.0–1.35x more tokens than Opus 4.6, depending on content type. Code and technical text tend to land near the lower end of that range; heavily structured or multilingual content near the upper end. Budget and cost forecasting built against 4.6 should be re-measured on real traffic.
More thinking at higher effort
Opus 4.7 thinks more at higher effort levels, particularly on later turns of agentic runs. Output token counts can rise as a result. Anthropic's own internal coding evaluation shows token usage per completed task is improved at every effort level, because accuracy rises faster than token spend — but that's the net aggregate, not a guarantee for every workload.
Prompt re-tuning
Because 4.7 follows instructions more literally, prompts written for earlier Claude models may need revision. The most common failure mode: bullet lists of suggestions that prior models treated as "optional hints" are now treated as hard requirements. Audit system prompts before flipping the model flag at scale.
Anthropic's migration guide includes concrete tuning advice and token-usage comparisons per effort level.
Cybersecurity Safeguards
Opus 4.7 is the first broadly-released model to ship with the Project Glasswing safeguard stack. During training, Anthropic explicitly experimented with reducing offensive cyber capabilities relative to raw model scale, and the release includes automated systems that detect and block requests indicating prohibited or high-risk cybersecurity uses.
On CyberGym, Opus 4.7 scores 73.1% — effectively flat against Opus 4.6's revised 73.8%. This flat line is a policy choice, not a capability ceiling: Mythos Preview scores 83.1% on the same benchmark but remains restricted to vetted Glasswing partners.
Security researchers working on legitimate offensive-security tasks (vulnerability research, penetration testing, red-teaming) can apply to the new Cyber Verification Program to access Opus 4.7 without triggering the default refusal behavior.
Outlook
Opus 4.7 is an incremental release by Anthropic's own framing — a direct upgrade to 4.6 rather than a new model tier. The version number signals it: 4.6 to 4.7, not 4.6 to 5.0. Mythos Preview remains the ceiling on Anthropic's frontier capability, and this release deliberately keeps some distance from it.
What makes 4.7 interesting is less the benchmark deltas and more the operational shape: literal instruction following, self-verification, higher-resolution vision, and file-system memory. These are the specific behaviors that separate a model you have to babysit from one you can hand a multi-hour task and walk away from. Devin, Factory, Ramp, and Notion's partner testimonials all land on the same point: fewer tool errors, less step-by-step guidance, longer autonomous runs.
For the full announcement and per-benchmark methodology, see Anthropic's launch post and system card.
Questions
Frequently Asked Questions
- Anthropic released Claude Opus 4.7 on April 16, 2026. It is generally available across Claude products, the Claude API (
claude-opus-4-7), Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry. - Claude Opus 4.7 pricing is $5 per million input tokens and $25 per million output tokens, unchanged from Opus 4.6. Note that Opus 4.7 uses an updated tokenizer, so the same input can map to roughly 1.0–1.35x more tokens than Opus 4.6.
- Opus 4.7 supports a 1 million token input context window with up to 128K output tokens, matching Opus 4.6. Prompts above 200K tokens are charged at a premium rate on the Claude API.
- Opus 4.7 beats Opus 4.6 across the reported benchmark suite: 87.6% vs 80.8% on SWE-bench Verified, 69.4% vs 65.4% on Terminal-Bench 2.0, 94.2% vs 91.3% on GPQA Diamond, and 64.4% vs 60.7% on Finance Agent (state-of-the-art at release). Early-access testers report low-effort 4.7 matches medium-effort 4.6.
- xhigh is a new effort level introduced in Opus 4.7, sitting between high and max. It gives developers finer control over the reasoning-vs-latency tradeoff. Claude Code now defaults to xhigh for all plans.
- Opus 4.7 accepts images up to 2,576 pixels on the long edge (~3.75 megapixels), more than 3.3x the resolution of prior Claude models. This is a model-level change applied automatically through the vision API.
Continue Reading
