Is Claude Opus 4.7 better than Opus 4.6?

Yes. Opus 4.7 wins on 12 of 14 reported benchmarks, with the largest jumps on MCP-Atlas (+14.6pp), CharXiv-R (+13.6pp), and SWE-bench Pro (+10.9pp). The two losses are BrowseComp (−4.7pp, sensitive to harness choice) and CyberGym (−0.7pp, intentionally flat). It also adds a new xhigh effort level, 3.3× higher-resolution vision, and self-verification on long agentic tasks.

Does Opus 4.7 cost more than Opus 4.6?

Per-token pricing is identical: $5 per 1M input tokens and $25 per 1M output tokens (or $10 / $37.50 above 200K-token prompts). Opus 4.7 uses an updated tokenizer that can map the same text to 1.0–1.35× more tokens, and it thinks more at higher effort levels. On the other side of the ledger, early-access testing shows low-effort 4.7 matches medium-effort 4.6 quality, so cost per completed task typically drops.

Should I migrate from Opus 4.6 to Opus 4.7?

For most workloads, yes — but run an eval pass first. Three things to watch: (1) prompts written for looser instruction-following can misfire because 4.7 follows instructions more literally; (2) token budgets built on 4.6 need re-measuring because of the tokenizer change; (3) if you depend on BrowseComp-style browsing under a specific harness, test before flipping the flag at scale.

What's the difference between the effort levels?

Opus 4.6 exposed low / medium / high / max. Opus 4.7 adds xhigh between high and max, giving a middle ground for reasoning-heavy work without the full latency of max. Claude Code now defaults to xhigh for all plans.

What changed in the Opus 4.7 tokenizer?

Opus 4.7 ships with an updated tokenizer. The same input text maps to roughly 1.0–1.35× more tokens than Opus 4.6. Code and technical text land near the lower end of that range; heavily structured or multilingual content near the upper end. Cost forecasts built against 4.6 should be re-measured on real traffic before flipping the model flag in production.

Can Opus 4.7 read higher-resolution images than Opus 4.6?

Yes. Opus 4.7 accepts images up to 2,576 pixels on the long edge (~3.75 megapixels), versus ~1,568 pixels (~1.15 MP) on prior Claude models — 3.3× more pixel area. This is a model-level change applied automatically through the vision API, so images sent to Claude are processed at higher fidelity without any code change.

Is there any reason to stay on Opus 4.6?

Rare cases: workloads that rely on 4.6's tokenizer for fixed budgeting, apps tuned to BrowseComp-style browsing under a specific multi-agent harness, or pipelines whose prompts lean on loose instruction interpretation. For everything else, 4.7 is the better default at the same per-token price.

Back to blog

Comparison·Technical Deep Dive

Claude Opus 4.7 vs Opus 4.6

Head-to-head comparison of Claude Opus 4.7 vs Opus 4.6: benchmark deltas, pricing, effort levels, vision, tokenizer, and a migration checklist. Opus 4.7 wins 12 of 14 reported benchmarks at the same $5/$25 price.

Jonathan Chavez

Co-Founder @ LLM Stats

Apr 17, 2026·10 min read

The Spectrum

Opus 4.6 / Opus 4.7

Same price.
Every benchmark
moves.

0/ 14

100

MCP-Atlas

62.7

77.3

+14.6

CharXiv-R (no tools)

68.7

82.1

+13.4

SWE-bench Pro

53.4

64.3

+10.9

SWE-bench Verified

80.8

87.6

+6.8

OSWorld-Verified

72.7

78.0

+5.3

GPQA Diamond

91.3

94.2

+2.9

BrowseComp

84.0

79.3

-4.7

Opus 4.6Opus 4.7 — gainregressionSelf-reported, Anthropic launch announcement

Anthropic released Claude Opus 4.7 on April 16, 2026, two months after Opus 4.6. It's a same-tier upgrade — not a new model family — and the pitch is specific: same price, higher capability per token, longer autonomous runs. This post is the direct head-to-head. Every benchmark, every price, every behavior change, and a migration checklist you can use as a reference when flipping the model flag in production.

The Verdict

Opus 4.7 is a drop-in upgrade to Opus 4.6. It beats 4.6 on 12 of 14 reported benchmarks, adds a new xhigh effort level, sees images at 3.3× higher resolution, follows instructions more literally, and introduces self-verification on long-running agentic work. The per-token price is unchanged. Low-effort 4.7 matches medium-effort 4.6 on quality, so the effective cost per completed task drops even though the $5 / $25 rate is identical.

Two caveats worth knowing before migrating. First, the updated tokenizer can map the same text to 1.0–1.35× more tokens than 4.6, which affects budgeting. Second, because 4.7 takes prompts more literally, prompts tuned for 4.6's looser interpretation can misfire. Both are quick to audit but worth a pass.

Side-by-Side at a Glance

Nothing on the commercial surface changed: same context window, same pricing tiers, same platforms. The differences are operational — new effort level, updated tokenizer, higher-resolution vision, and new behaviors around self-verification and instruction following.

Spec	Opus 4.6	Opus 4.7
Release date	Feb 5, 2026	Apr 16, 2026
Model ID	`claude-opus-4-6`	`claude-opus-4-7`
Input / output price (≤200K)	$5 / $25 per 1M	$5 / $25 per 1M
Input / output price (>200K)	$10 / $37.50 per 1M	$10 / $37.50 per 1M
Context window	1M input / 128K output	1M input / 128K output
Modalities	Text + image (~1.15 MP)	Text + image (~3.75 MP)
Effort levels	low / medium / high / max	low / medium / high / xhigh / max
Tokenizer	Original	Updated (1.0–1.35× more tokens)
Self-verification loop	No	Yes (Plan → Execute → Verify → Report)
File-system memory	Standard	Improved multi-session reuse
Cybersecurity safeguards	Standard	Project Glasswing safeguards

Benchmark Deltas

Every benchmark below is self-reported by Anthropic in the Opus 4.7 launch announcement. The chart sorts by delta so the largest moves appear first. Positive bars (in Anthropic orange) are gains; negative bars (in rose) are regressions.

Every Benchmark

14 benchmarks · 12 gains · 2 regressions

Lines going up are wins.
Two come back down.

Avg delta+5.6pp

Self-reported, Anthropic launch announcement. Scores on a 0–100 scale; axis clipped to 35–100.

The story on the benchmark page is that gains concentrate on the hardest, least-saturated problems: SWE-bench Pro (+10.9pp) jumps more than SWE-bench Verified (+6.8pp), HLE without tools (+6.9pp) jumps more than HLE with tools (+1.6pp), and MCP-Atlas (+14.6pp) — the agentic tool-use benchmark — takes the single largest jump of the release.

Coding

Benchmark	Opus 4.6	Opus 4.7	Delta
SWE-bench Verified	80.8%	87.6%	+6.8
SWE-bench Pro	53.4%	64.3%	+10.9
Terminal-Bench 2.0	65.4%	69.4%	+4.0

Partner numbers reinforce the benchmark story. Replit reports same-quality output at lower cost, Rakuten measured 3× more production tasks resolved, and Cursor reports 70% on CursorBench vs 58% for Opus 4.6.

Reasoning & Knowledge

Benchmark	Opus 4.6	Opus 4.7	Delta
GPQA Diamond	91.3%	94.2%	+2.9
HLE (with tools)	53.1%	54.7%	+1.6
HLE (without tools)	40.0%	46.9%	+6.9
MMMLU	91.1%	91.5%	+0.4

Agents

Benchmark	Opus 4.6	Opus 4.7	Delta
MCP-Atlas	62.7%	77.3%	+14.6
OSWorld-Verified	72.7%	78.0%	+5.3
Finance Agent ^SOTA	60.7%	64.4%	+3.7
BrowseComp	84.0%	79.3%	−4.7
CyberGym	73.8%	73.1%	−0.7

BrowseComp is the only real regression. Opus 4.6's 84.0% was measured under a multi-agent harness at max effort; the comparison is sensitive to harness choice, but the delta is honest enough to flag. CyberGym is effectively flat by design — Anthropic states it "experimented with efforts to differentially reduce" offensive cyber capabilities during training.

Vision

Benchmark	Opus 4.6	Opus 4.7	Delta
CharXiv-R (with tools)	77.4%	91.0%	+13.6
CharXiv-R (without tools)	68.7%	82.1%	+13.4

The no-tools delta is the one that matters: +13.4pp isolates the vision capability itself rather than the image-cropping tool. XBOW similarly reports a 98.5% visual-acuity scoreon Opus 4.7 (vs 54.5% on 4.6) — large enough to unblock autonomous pen-testing workflows that weren't viable on 4.6.

Win / Loss Scorecard

Grouped by capability domain. Cybersecurity is called out separately because it's a deliberate non-improvement rather than a regression.

By Capability

Four domains

Every domain moves forward.
One regression lives in agents.

Coding

Plan, implement, ship.

+10.9SWE-bench Pro
+6.8SWE-bench Verified
+4.0Terminal-Bench 2.0

Avg delta

+7.2

3 of 3 gains

Reasoning

The raw capability floor.

+6.9HLE (no tools)
+2.9GPQA Diamond
+1.6HLE (with tools)
+0.4MMMLU

Avg delta

+3.0

4 of 4 gains

Agents

Long-running autonomous work.

+14.6MCP-Atlas
+5.3OSWorld-Verified
+3.7Finance Agent
-4.7BrowseComp

Avg delta

+4.7

3 of 4 gains

Vision

Higher-fidelity perception.

+13.6CharXiv-R (tools)
+13.4CharXiv-R (no tools)

Avg delta

+13.5

2 of 2 gains

Cybersecurity (CyberGym −0.7) intentionally flat — excluded from the category tally.

What Actually Changed

Benchmarks are the score; behaviors are the story. Four operational changes make Opus 4.7 feel different in practice, even on prompts where the benchmark delta is small.

Self-verification before reporting

Anthropic's framing: Opus 4.7 "devises ways to verify its own outputs before reporting back." In practice this means the model writes tests, runs sanity checks, and inspects its own output before declaring a task complete. Vercel reports 4.7 "does proofs on systems code before starting work" — behavior not seen on 4.6. On long agentic runs, this is the single change most users report feeling first: fewer confident-but-wrong reports back.

Literal instruction following

Opus 4.7 follows instructions more literally than 4.6 or any prior Claude model. Anthropic flags this explicitly as a migration concern: prompts that depended on loose interpretation may now produce unexpected results because 4.7 takes the wording at face value. The most common failure mode is bullet lists of "suggestions" that 4.6 treated as optional hints being read as hard requirements on 4.7. Audit system prompts before rollout.

Higher-resolution vision

Images up to 2,576 pixels on the long edge (~3.75 MP) — versus ~1,568 px (~1.15 MP) on 4.6. 3.3× more pixel area per image, applied automatically through the vision API. Two practical consequences: computer-use agents can read dense screenshots without operator-side pre-cropping, and data extraction from complex diagrams jumps sharply (see the +13.4pp CharXiv-R no-tools delta).

File-system memory for multi-session work

Opus 4.7 is better at reading, writing, and reusing notes on a persistent file system across sessions. For agents that work over days rather than minutes — think a long-running engineering task that spans multiple model turns across multiple sessions — this removes the need to re-establish context at the start of every run.

New xhigh effort level

xhigh is a new tier between high and max. It gives developers finer control over the reasoning-vs-latency tradeoff: more thinking than high, without the full cost of max. Claude Code raised its default effort to xhigh for all plans on the 4.7 rollout.

Cost: Same Price, Fewer Tokens

Per-token pricing is identical. The cost story lives at the task level: Hex's early-access testing found low-effort Opus 4.7 matches medium-effort Opus 4.6 on quality, so the same completed work uses meaningfully fewer tokens at a cheaper effort tier. Anthropic's internal coding eval similarly reports token usage per completed task improved at every effort level — accuracy rises faster than token spend.

The Cost Story

Per completed task

The price per token didn't change.
The tokens per task did.

Hex's early-access testing found low-effort Opus 4.7 matches the quality of medium-effort Opus 4.6. The same finished work, with a cheaper effort tier, at an identical $5 / $25 per-token rate. Anthropic's own coding eval reports the same pattern across every effort level: accuracy rises faster than token spend.

1007550250

Opus 4.6

Medium effort

Same finished
task

−0%

1007550250

Opus 4.7

Low effort

Per-token price

$5 / $25

Identical to Opus 4.6. Long-prompt rates ($10 / $37.50 above 200K) also unchanged.

Tokenizer shift

1.0 – 1.35×

Same input text maps to more tokens on 4.7. Content-dependent. Re-measure budgets.

Thinking at higher effort

Output token counts rise at high / xhigh / max on reasoning-heavy turns.

Relative cost illustrative of Hex's early-access testing. Your workload may vary.

There are two offsetting pressures to plan for. The updated tokenizer can map the same text to 1.0–1.35× more tokens than 4.6, so static token budgets built on 4.6 need re-measuring. And at higher effort levels, 4.7 thinks more than 4.6 did — output token counts can rise on reasoning-heavy turns even when the task itself is the same. On aggregate, these effects are smaller than the quality-per-effort-tier gain, but any individual workload should be benchmarked before flipping the flag at scale.

Cost lever	Effect on spend vs 4.6
Per-token price (≤200K)	Identical ($5 / $25 per 1M)
Tokens per task at matched quality	Lower (low-effort 4.7 ≈ medium-effort 4.6)
Tokens from updated tokenizer	+0–35% (content-dependent)
Tokens at matched effort level	Higher at high / xhigh / max (more thinking)

Migration: 4.6 → 4.7

Opus 4.7 is API-compatible with 4.6. The surface change is the model ID (claude-opus-4-6 → claude-opus-4-7). Everything else is an evaluation exercise, not a code change.

Checklist

Audit system prompts for literal-instruction risk.Hunt for phrases like "consider," "you might," and bullet lists of "suggestions." 4.7 reads these closer to hard requirements than 4.6 did.
Re-measure token budgets on real traffic. The tokenizer change can shift the same text by up to 35%. Static cost forecasts built on 4.6 will drift; dynamic budgets that meter on-the-fly are unaffected.
Pick an effort level intentionally. For coding and agentic use cases, start with high or xhigh. For simple classification or summarization, low on 4.7 typically matches medium on 4.6 at lower cost.
Turn on task budgets if you run agents. Paired with xhigh, budgets let you say "think hard on this, but don't burn more than N tokens finishing it." Prevents runaway multi-agent branches.
Downsample images only when needed.Opus 4.7 processes higher-res images automatically, which costs more tokens per image. Workloads that don't need the extra detail should downsample client-side before sending.
Run an eval pass. Especially for BrowseComp-style browsing under a specific multi-agent harness, where 4.7 can regress. A quick A/B on a held-out set catches these before production.
Apply to the Cyber Verification Program if relevant. Legitimate offensive-security work (vulnerability research, pen-testing, red-teaming) on 4.7 requires the new program to avoid default refusals. Not needed for defensive or general-purpose workloads.

Anthropic's official migration guide includes concrete prompt-tuning advice and token-usage comparisons per effort level.

When to Upgrade, When to Stay

Your workload	Recommendation
Agentic coding (Claude Code, Cursor, Devin-style)	Upgrade. Largest gains: SWE-bench Pro +10.9pp, MCP-Atlas +14.6pp, self-verification cuts double-report errors.
Document analysis, charts, dense screenshots	Upgrade. CharXiv-R +13.4pp without tools; 3.3× vision resolution unblocks previously-cropped inputs.
Computer-use / browser agents	Upgrade with eval pass. OSWorld +5.3pp is strong, but BrowseComp −4.7pp flags harness-sensitive regressions. A/B on your real flows first.
Reasoning, knowledge, math	Upgrade. GPQA +2.9pp, HLE no-tools +6.9pp. Marginal on MMMLU (+0.4pp) but never worse.
Long-horizon autonomous workflows	Upgrade.File-system memory + self-verification are the biggest operational wins; partners report "longer autonomous runs" as the top behavioral change.
Offensive-security research	Apply to Cyber Verification Program. Default 4.7 will refuse more here; the verified program exists for legitimate pen-testing use.
Fixed token-budget pipelines	Re-measure first. Tokenizer change can shift spend up to 35%. Either re-cap budgets or switch to dynamic metering before upgrading.
Prompts that rely on loose interpretation	Audit first. 4.7 reads instructions more literally. Re-tune or stay on 4.6 until prompts are reviewed.

For the full announcement, see Anthropic's Opus 4.7 launch post and system card. For a deeper breakdown of the release itself, see our Opus 4.7 launch overview.

Questions

Frequently Asked Questions

Yes. Opus 4.7 wins on 12 of 14 reported benchmarks, with the largest jumps on MCP-Atlas (+14.6pp), CharXiv-R (+13.6pp), and SWE-bench Pro (+10.9pp). The two losses are BrowseComp (−4.7pp, sensitive to harness choice) and CyberGym (−0.7pp, intentionally flat). It also adds a new xhigh effort level, 3.3× higher-resolution vision, and self-verification on long agentic tasks.
Per-token pricing is identical: $5 per 1M input tokens and $25 per 1M output tokens (or $10 / $37.50 above 200K-token prompts). Opus 4.7 uses an updated tokenizer that can map the same text to 1.0–1.35× more tokens, and it thinks more at higher effort levels. On the other side of the ledger, early-access testing shows low-effort 4.7 matches medium-effort 4.6 quality, so cost per completed task typically drops.
For most workloads, yes — but run an eval pass first. Three things to watch: (1) prompts written for looser instruction-following can misfire because 4.7 follows instructions more literally; (2) token budgets built on 4.6 need re-measuring because of the tokenizer change; (3) if you depend on BrowseComp-style browsing under a specific harness, test before flipping the flag at scale.
Opus 4.6 exposed low / medium / high / max. Opus 4.7 adds xhigh between high and max, giving a middle ground for reasoning-heavy work without the full latency of max. Claude Code now defaults to xhigh for all plans.
Opus 4.7 ships with an updated tokenizer. The same input text maps to roughly 1.0–1.35× more tokens than Opus 4.6. Code and technical text land near the lower end of that range; heavily structured or multilingual content near the upper end. Cost forecasts built against 4.6 should be re-measured on real traffic before flipping the model flag in production.
Yes. Opus 4.7 accepts images up to 2,576 pixels on the long edge (~3.75 megapixels), versus ~1,568 pixels (~1.15 MP) on prior Claude models — 3.3× more pixel area. This is a model-level change applied automatically through the vision API, so images sent to Claude are processed at higher fidelity without any code change.
Rare cases: workloads that rely on 4.6's tokenizer for fixed budgeting, apps tuned to BrowseComp-style browsing under a specific multi-agent harness, or pipelines whose prompts lean on loose instruction interpretation. For everything else, 4.7 is the better default at the same per-token price.

Same price.Every benchmarkmoves.

The Verdict

Side-by-Side at a Glance

Benchmark Deltas

Lines going up are wins.Two come back down.

Coding

Reasoning & Knowledge

Agents

Vision

Win / Loss Scorecard

Every domain moves forward.One regression lives in agents.

Coding

Reasoning

Agents

Vision

What Actually Changed

Self-verification before reporting

Literal instruction following

Higher-resolution vision

File-system memory for multi-session work

New xhigh effort level

Cost: Same Price, Fewer Tokens

The price per token didn't change.The tokens per task did.

Migration: 4.6 → 4.7

Checklist

When to Upgrade, When to Stay

Same price.
Every benchmark
moves.

Lines going up are wins.
Two come back down.

Every domain moves forward.
One regression lives in agents.

The price per token didn't change.
The tokens per task did.