Claude Opus 4.7 vs Opus 4.6
Head-to-head comparison of Claude Opus 4.7 vs Opus 4.6: benchmark deltas, pricing, effort levels, vision, tokenizer, and a migration checklist. Opus 4.7 wins 12 of 14 reported benchmarks at the same $5/$25 price.

The Spectrum
Opus 4.6 / Opus 4.7
Same price.
Every benchmark
moves.
Anthropic released Claude Opus 4.7 on April 16, 2026, two months after Opus 4.6. It's a same-tier upgrade — not a new model family — and the pitch is specific: same price, higher capability per token, longer autonomous runs. This post is the direct head-to-head. Every benchmark, every price, every behavior change, and a migration checklist you can use as a reference when flipping the model flag in production.
The Verdict
Opus 4.7 is a drop-in upgrade to Opus 4.6. It beats 4.6 on 12 of 14 reported benchmarks, adds a new xhigh effort level, sees images at 3.3× higher resolution, follows instructions more literally, and introduces self-verification on long-running agentic work. The per-token price is unchanged. Low-effort 4.7 matches medium-effort 4.6 on quality, so the effective cost per completed task drops even though the $5 / $25 rate is identical.
Two caveats worth knowing before migrating. First, the updated tokenizer can map the same text to 1.0–1.35× more tokens than 4.6, which affects budgeting. Second, because 4.7 takes prompts more literally, prompts tuned for 4.6's looser interpretation can misfire. Both are quick to audit but worth a pass.
Side-by-Side at a Glance
Nothing on the commercial surface changed: same context window, same pricing tiers, same platforms. The differences are operational — new effort level, updated tokenizer, higher-resolution vision, and new behaviors around self-verification and instruction following.
| Spec | Opus 4.6 | Opus 4.7 |
|---|---|---|
| Release date | Feb 5, 2026 | Apr 16, 2026 |
| Model ID | claude-opus-4-6 | claude-opus-4-7 |
| Input / output price (≤200K) | $5 / $25 per 1M | $5 / $25 per 1M |
| Input / output price (>200K) | $10 / $37.50 per 1M | $10 / $37.50 per 1M |
| Context window | 1M input / 128K output | 1M input / 128K output |
| Modalities | Text + image (~1.15 MP) | Text + image (~3.75 MP) |
| Effort levels | low / medium / high / max | low / medium / high / xhigh / max |
| Tokenizer | Original | Updated (1.0–1.35× more tokens) |
| Self-verification loop | No | Yes (Plan → Execute → Verify → Report) |
| File-system memory | Standard | Improved multi-session reuse |
| Cybersecurity safeguards | Standard | Project Glasswing safeguards |
Benchmark Deltas
Every benchmark below is self-reported by Anthropic in the Opus 4.7 launch announcement. The chart sorts by delta so the largest moves appear first. Positive bars (in Anthropic orange) are gains; negative bars (in rose) are regressions.
Every Benchmark
14 benchmarks · 12 gains · 2 regressions
Lines going up are wins.
Two come back down.
The story on the benchmark page is that gains concentrate on the hardest, least-saturated problems: SWE-bench Pro (+10.9pp) jumps more than SWE-bench Verified (+6.8pp), HLE without tools (+6.9pp) jumps more than HLE with tools (+1.6pp), and MCP-Atlas (+14.6pp) — the agentic tool-use benchmark — takes the single largest jump of the release.
Coding
| Benchmark | Opus 4.6 | Opus 4.7 | Delta |
|---|---|---|---|
| SWE-bench Verified | 80.8% | 87.6% | +6.8 |
| SWE-bench Pro | 53.4% | 64.3% | +10.9 |
| Terminal-Bench 2.0 | 65.4% | 69.4% | +4.0 |
Partner numbers reinforce the benchmark story. Replit reports same-quality output at lower cost, Rakuten measured 3× more production tasks resolved, and Cursor reports 70% on CursorBench vs 58% for Opus 4.6.
Reasoning & Knowledge
| Benchmark | Opus 4.6 | Opus 4.7 | Delta |
|---|---|---|---|
| GPQA Diamond | 91.3% | 94.2% | +2.9 |
| HLE (with tools) | 53.1% | 54.7% | +1.6 |
| HLE (without tools) | 40.0% | 46.9% | +6.9 |
| MMMLU | 91.1% | 91.5% | +0.4 |
Agents
| Benchmark | Opus 4.6 | Opus 4.7 | Delta |
|---|---|---|---|
| MCP-Atlas | 62.7% | 77.3% | +14.6 |
| OSWorld-Verified | 72.7% | 78.0% | +5.3 |
| Finance Agent SOTA | 60.7% | 64.4% | +3.7 |
| BrowseComp | 84.0% | 79.3% | −4.7 |
| CyberGym | 73.8% | 73.1% | −0.7 |
BrowseComp is the only real regression. Opus 4.6's 84.0% was measured under a multi-agent harness at max effort; the comparison is sensitive to harness choice, but the delta is honest enough to flag. CyberGym is effectively flat by design — Anthropic states it "experimented with efforts to differentially reduce" offensive cyber capabilities during training.
Vision
| Benchmark | Opus 4.6 | Opus 4.7 | Delta |
|---|---|---|---|
| CharXiv-R (with tools) | 77.4% | 91.0% | +13.6 |
| CharXiv-R (without tools) | 68.7% | 82.1% | +13.4 |
The no-tools delta is the one that matters: +13.4pp isolates the vision capability itself rather than the image-cropping tool. XBOW similarly reports a 98.5% visual-acuity score on Opus 4.7 (vs 54.5% on 4.6) — large enough to unblock autonomous pen-testing workflows that weren't viable on 4.6.
Win / Loss Scorecard
Grouped by capability domain. Cybersecurity is called out separately because it's a deliberate non-improvement rather than a regression.
By Capability
Four domains
Every domain moves forward.
One regression lives in agents.
Coding
Plan, implement, ship.
- +10.9SWE-bench Pro
- +6.8SWE-bench Verified
- +4.0Terminal-Bench 2.0
Avg delta
3 of 3 gains
Reasoning
The raw capability floor.
- +6.9HLE (no tools)
- +2.9GPQA Diamond
- +1.6HLE (with tools)
- +0.4MMMLU
Avg delta
4 of 4 gains
Agents
Long-running autonomous work.
- +14.6MCP-Atlas
- +5.3OSWorld-Verified
- +3.7Finance Agent
- -4.7BrowseComp
Avg delta
3 of 4 gains
Vision
Higher-fidelity perception.
- +13.6CharXiv-R (tools)
- +13.4CharXiv-R (no tools)
Avg delta
2 of 2 gains
What Actually Changed
Benchmarks are the score; behaviors are the story. Four operational changes make Opus 4.7 feel different in practice, even on prompts where the benchmark delta is small.
Self-verification before reporting
Anthropic's framing: Opus 4.7 "devises ways to verify its own outputs before reporting back." In practice this means the model writes tests, runs sanity checks, and inspects its own output before declaring a task complete. Vercel reports 4.7 "does proofs on systems code before starting work" — behavior not seen on 4.6. On long agentic runs, this is the single change most users report feeling first: fewer confident-but-wrong reports back.
Literal instruction following
Opus 4.7 follows instructions more literally than 4.6 or any prior Claude model. Anthropic flags this explicitly as a migration concern: prompts that depended on loose interpretation may now produce unexpected results because 4.7 takes the wording at face value. The most common failure mode is bullet lists of "suggestions" that 4.6 treated as optional hints being read as hard requirements on 4.7. Audit system prompts before rollout.
Higher-resolution vision
Images up to 2,576 pixels on the long edge (~3.75 MP) — versus ~1,568 px (~1.15 MP) on 4.6. 3.3× more pixel area per image, applied automatically through the vision API. Two practical consequences: computer-use agents can read dense screenshots without operator-side pre-cropping, and data extraction from complex diagrams jumps sharply (see the +13.4pp CharXiv-R no-tools delta).
File-system memory for multi-session work
Opus 4.7 is better at reading, writing, and reusing notes on a persistent file system across sessions. For agents that work over days rather than minutes — think a long-running engineering task that spans multiple model turns across multiple sessions — this removes the need to re-establish context at the start of every run.
New xhigh effort level
xhigh is a new tier between high and max. It gives developers finer control over the reasoning-vs-latency tradeoff: more thinking than high, without the full cost of max. Claude Code raised its default effort to xhigh for all plans on the 4.7 rollout.
Cost: Same Price, Fewer Tokens
Per-token pricing is identical. The cost story lives at the task level: Hex's early-access testing found low-effort Opus 4.7 matches medium-effort Opus 4.6 on quality, so the same completed work uses meaningfully fewer tokens at a cheaper effort tier. Anthropic's internal coding eval similarly reports token usage per completed task improved at every effort level — accuracy rises faster than token spend.
The Cost Story
Per completed task
The price per token didn't change.
The tokens per task did.
Hex's early-access testing found low-effort Opus 4.7 matches the quality of medium-effort Opus 4.6. The same finished work, with a cheaper effort tier, at an identical $5 / $25 per-token rate. Anthropic's own coding eval reports the same pattern across every effort level: accuracy rises faster than token spend.
task
Per-token price
$5 / $25
Identical to Opus 4.6. Long-prompt rates ($10 / $37.50 above 200K) also unchanged.
Tokenizer shift
1.0 – 1.35×
Same input text maps to more tokens on 4.7. Content-dependent. Re-measure budgets.
Thinking at higher effort
More
Output token counts rise at high / xhigh / max on reasoning-heavy turns.
There are two offsetting pressures to plan for. The updated tokenizer can map the same text to 1.0–1.35× more tokens than 4.6, so static token budgets built on 4.6 need re-measuring. And at higher effort levels, 4.7 thinks more than 4.6 did — output token counts can rise on reasoning-heavy turns even when the task itself is the same. On aggregate, these effects are smaller than the quality-per-effort-tier gain, but any individual workload should be benchmarked before flipping the flag at scale.
| Cost lever | Effect on spend vs 4.6 |
|---|---|
| Per-token price (≤200K) | Identical ($5 / $25 per 1M) |
| Tokens per task at matched quality | Lower (low-effort 4.7 ≈ medium-effort 4.6) |
| Tokens from updated tokenizer | +0–35% (content-dependent) |
| Tokens at matched effort level | Higher at high / xhigh / max (more thinking) |
Migration: 4.6 → 4.7
Opus 4.7 is API-compatible with 4.6. The surface change is the model ID (claude-opus-4-6 → claude-opus-4-7). Everything else is an evaluation exercise, not a code change.
Checklist
- Audit system prompts for literal-instruction risk. Hunt for phrases like "consider," "you might," and bullet lists of "suggestions." 4.7 reads these closer to hard requirements than 4.6 did.
- Re-measure token budgets on real traffic. The tokenizer change can shift the same text by up to 35%. Static cost forecasts built on 4.6 will drift; dynamic budgets that meter on-the-fly are unaffected.
- Pick an effort level intentionally. For coding and agentic use cases, start with high or xhigh. For simple classification or summarization, low on 4.7 typically matches medium on 4.6 at lower cost.
- Turn on task budgets if you run agents. Paired with
xhigh, budgets let you say "think hard on this, but don't burn more than N tokens finishing it." Prevents runaway multi-agent branches. - Downsample images only when needed. Opus 4.7 processes higher-res images automatically, which costs more tokens per image. Workloads that don't need the extra detail should downsample client-side before sending.
- Run an eval pass. Especially for BrowseComp-style browsing under a specific multi-agent harness, where 4.7 can regress. A quick A/B on a held-out set catches these before production.
- Apply to the Cyber Verification Program if relevant. Legitimate offensive-security work (vulnerability research, pen-testing, red-teaming) on 4.7 requires the new program to avoid default refusals. Not needed for defensive or general-purpose workloads.
Anthropic's official migration guide includes concrete prompt-tuning advice and token-usage comparisons per effort level.
When to Upgrade, When to Stay
| Your workload | Recommendation |
|---|---|
| Agentic coding (Claude Code, Cursor, Devin-style) | Upgrade. Largest gains: SWE-bench Pro +10.9pp, MCP-Atlas +14.6pp, self-verification cuts double-report errors. |
| Document analysis, charts, dense screenshots | Upgrade. CharXiv-R +13.4pp without tools; 3.3× vision resolution unblocks previously-cropped inputs. |
| Computer-use / browser agents | Upgrade with eval pass. OSWorld +5.3pp is strong, but BrowseComp −4.7pp flags harness-sensitive regressions. A/B on your real flows first. |
| Reasoning, knowledge, math | Upgrade. GPQA +2.9pp, HLE no-tools +6.9pp. Marginal on MMMLU (+0.4pp) but never worse. |
| Long-horizon autonomous workflows | Upgrade. File-system memory + self-verification are the biggest operational wins; partners report "longer autonomous runs" as the top behavioral change. |
| Offensive-security research | Apply to Cyber Verification Program. Default 4.7 will refuse more here; the verified program exists for legitimate pen-testing use. |
| Fixed token-budget pipelines | Re-measure first. Tokenizer change can shift spend up to 35%. Either re-cap budgets or switch to dynamic metering before upgrading. |
| Prompts that rely on loose interpretation | Audit first. 4.7 reads instructions more literally. Re-tune or stay on 4.6 until prompts are reviewed. |
For the full announcement, see Anthropic's Opus 4.7 launch post and system card. For a deeper breakdown of the release itself, see our Opus 4.7 launch overview.
Questions
Frequently Asked Questions
- Yes. Opus 4.7 wins on 12 of 14 reported benchmarks, with the largest jumps on MCP-Atlas (+14.6pp), CharXiv-R (+13.6pp), and SWE-bench Pro (+10.9pp). The two losses are BrowseComp (−4.7pp, sensitive to harness choice) and CyberGym (−0.7pp, intentionally flat). It also adds a new xhigh effort level, 3.3× higher-resolution vision, and self-verification on long agentic tasks.
- Per-token pricing is identical: $5 per 1M input tokens and $25 per 1M output tokens (or $10 / $37.50 above 200K-token prompts). Opus 4.7 uses an updated tokenizer that can map the same text to 1.0–1.35× more tokens, and it thinks more at higher effort levels. On the other side of the ledger, early-access testing shows low-effort 4.7 matches medium-effort 4.6 quality, so cost per completed task typically drops.
- For most workloads, yes — but run an eval pass first. Three things to watch: (1) prompts written for looser instruction-following can misfire because 4.7 follows instructions more literally; (2) token budgets built on 4.6 need re-measuring because of the tokenizer change; (3) if you depend on BrowseComp-style browsing under a specific harness, test before flipping the flag at scale.
- Opus 4.6 exposed low / medium / high / max. Opus 4.7 adds xhigh between high and max, giving a middle ground for reasoning-heavy work without the full latency of max. Claude Code now defaults to xhigh for all plans.
- Opus 4.7 ships with an updated tokenizer. The same input text maps to roughly 1.0–1.35× more tokens than Opus 4.6. Code and technical text land near the lower end of that range; heavily structured or multilingual content near the upper end. Cost forecasts built against 4.6 should be re-measured on real traffic before flipping the model flag in production.
- Yes. Opus 4.7 accepts images up to 2,576 pixels on the long edge (~3.75 megapixels), versus ~1,568 pixels (~1.15 MP) on prior Claude models — 3.3× more pixel area. This is a model-level change applied automatically through the vision API, so images sent to Claude are processed at higher fidelity without any code change.
Rare cases: workloads that rely on 4.6's tokenizer for fixed budgeting, apps tuned to BrowseComp-style browsing under a specific multi-agent harness, or pipelines whose prompts lean on loose instruction interpretation. For everything else, 4.7 is the better default at the same per-token price.
Continue Reading
