Back to blog
Comparison·Technical Deep Dive

Claude Opus 4.7 vs Opus 4.6

Head-to-head comparison of Claude Opus 4.7 vs Opus 4.6: benchmark deltas, pricing, effort levels, vision, tokenizer, and a migration checklist. Opus 4.7 wins 12 of 14 reported benchmarks at the same $5/$25 price.

Jonathan Chavez
Jonathan Chavez
Co-Founder @ LLM Stats
·10 min read
Claude Opus 4.7 vs Opus 4.6

The Spectrum

Opus 4.6  /  Opus 4.7

Same price.
Every benchmark
moves.

0/ 14
MCP-Atlas
62.7
77.3
CharXiv-R (no tools)
68.7
82.1
SWE-bench Pro
53.4
64.3
SWE-bench Verified
80.8
87.6
OSWorld-Verified
72.7
78.0
GPQA Diamond
91.3
94.2
BrowseComp
84.0
79.3
Opus 4.6Opus 4.7 — gainregressionSelf-reported, Anthropic launch announcement

Anthropic released Claude Opus 4.7 on April 16, 2026, two months after Opus 4.6. It's a same-tier upgrade — not a new model family — and the pitch is specific: same price, higher capability per token, longer autonomous runs. This post is the direct head-to-head. Every benchmark, every price, every behavior change, and a migration checklist you can use as a reference when flipping the model flag in production.


The Verdict

Opus 4.7 is a drop-in upgrade to Opus 4.6. It beats 4.6 on 12 of 14 reported benchmarks, adds a new xhigh effort level, sees images at 3.3× higher resolution, follows instructions more literally, and introduces self-verification on long-running agentic work. The per-token price is unchanged. Low-effort 4.7 matches medium-effort 4.6 on quality, so the effective cost per completed task drops even though the $5 / $25 rate is identical.

Two caveats worth knowing before migrating. First, the updated tokenizer can map the same text to 1.0–1.35× more tokens than 4.6, which affects budgeting. Second, because 4.7 takes prompts more literally, prompts tuned for 4.6's looser interpretation can misfire. Both are quick to audit but worth a pass.


Side-by-Side at a Glance

Nothing on the commercial surface changed: same context window, same pricing tiers, same platforms. The differences are operational — new effort level, updated tokenizer, higher-resolution vision, and new behaviors around self-verification and instruction following.

SpecOpus 4.6Opus 4.7
Release dateFeb 5, 2026Apr 16, 2026
Model IDclaude-opus-4-6claude-opus-4-7
Input / output price (≤200K)$5 / $25 per 1M$5 / $25 per 1M
Input / output price (>200K)$10 / $37.50 per 1M$10 / $37.50 per 1M
Context window1M input / 128K output1M input / 128K output
ModalitiesText + image (~1.15 MP)Text + image (~3.75 MP)
Effort levelslow / medium / high / maxlow / medium / high / xhigh / max
TokenizerOriginalUpdated (1.0–1.35× more tokens)
Self-verification loopNoYes (Plan → Execute → Verify → Report)
File-system memoryStandardImproved multi-session reuse
Cybersecurity safeguardsStandardProject Glasswing safeguards

Benchmark Deltas

Every benchmark below is self-reported by Anthropic in the Opus 4.7 launch announcement. The chart sorts by delta so the largest moves appear first. Positive bars (in Anthropic orange) are gains; negative bars (in rose) are regressions.

Every Benchmark

14 benchmarks  ·  12 gains  ·  2 regressions

Lines going up are wins.
Two come back down.

Avg delta+5.6pp
Opus 4.6Opus 4.75060708090100GPQA Diamond91.394.2+2.9MMMLU91.191.5+0.4CharXiv-R (tools)77.491.0+13.6SWE-bench Verified80.887.6+6.8BrowseComp84.079.3-4.7CharXiv-R (no tools)68.782.1+13.4OSWorld-Verified72.778.0+5.3MCP-Atlas62.777.3+14.6CyberGym73.873.1-0.7Terminal-Bench 2.065.469.4+4.0SWE-bench Pro53.464.3+10.9Finance Agent60.764.4+3.7HLE (tools)53.154.7+1.6HLE (no tools)40.046.9+6.9
Self-reported, Anthropic launch announcement. Scores on a 0–100 scale; axis clipped to 35–100.

The story on the benchmark page is that gains concentrate on the hardest, least-saturated problems: SWE-bench Pro (+10.9pp) jumps more than SWE-bench Verified (+6.8pp), HLE without tools (+6.9pp) jumps more than HLE with tools (+1.6pp), and MCP-Atlas (+14.6pp) — the agentic tool-use benchmark — takes the single largest jump of the release.

Coding

BenchmarkOpus 4.6Opus 4.7Delta
SWE-bench Verified80.8%87.6%+6.8
SWE-bench Pro53.4%64.3%+10.9
Terminal-Bench 2.065.4%69.4%+4.0

Partner numbers reinforce the benchmark story. Replit reports same-quality output at lower cost, Rakuten measured 3× more production tasks resolved, and Cursor reports 70% on CursorBench vs 58% for Opus 4.6.

Reasoning & Knowledge

BenchmarkOpus 4.6Opus 4.7Delta
GPQA Diamond91.3%94.2%+2.9
HLE (with tools)53.1%54.7%+1.6
HLE (without tools)40.0%46.9%+6.9
MMMLU91.1%91.5%+0.4

Agents

BenchmarkOpus 4.6Opus 4.7Delta
MCP-Atlas62.7%77.3%+14.6
OSWorld-Verified72.7%78.0%+5.3
Finance Agent SOTA60.7%64.4%+3.7
BrowseComp84.0%79.3%−4.7
CyberGym73.8%73.1%−0.7

BrowseComp is the only real regression. Opus 4.6's 84.0% was measured under a multi-agent harness at max effort; the comparison is sensitive to harness choice, but the delta is honest enough to flag. CyberGym is effectively flat by design — Anthropic states it "experimented with efforts to differentially reduce" offensive cyber capabilities during training.

Vision

BenchmarkOpus 4.6Opus 4.7Delta
CharXiv-R (with tools)77.4%91.0%+13.6
CharXiv-R (without tools)68.7%82.1%+13.4

The no-tools delta is the one that matters: +13.4pp isolates the vision capability itself rather than the image-cropping tool. XBOW similarly reports a 98.5% visual-acuity score on Opus 4.7 (vs 54.5% on 4.6) — large enough to unblock autonomous pen-testing workflows that weren't viable on 4.6.


Win / Loss Scorecard

Grouped by capability domain. Cybersecurity is called out separately because it's a deliberate non-improvement rather than a regression.

By Capability

Four domains

Every domain moves forward.
One regression lives in agents.

01

Coding

Plan, implement, ship.

  • +10.9SWE-bench Pro
  • +6.8SWE-bench Verified
  • +4.0Terminal-Bench 2.0
02

Reasoning

The raw capability floor.

  • +6.9HLE (no tools)
  • +2.9GPQA Diamond
  • +1.6HLE (with tools)
  • +0.4MMMLU
03

Agents

Long-running autonomous work.

  • +14.6MCP-Atlas
  • +5.3OSWorld-Verified
  • +3.7Finance Agent
  • -4.7BrowseComp
04

Vision

Higher-fidelity perception.

  • +13.6CharXiv-R (tools)
  • +13.4CharXiv-R (no tools)
Cybersecurity (CyberGym −0.7) intentionally flat — excluded from the category tally.

What Actually Changed

Benchmarks are the score; behaviors are the story. Four operational changes make Opus 4.7 feel different in practice, even on prompts where the benchmark delta is small.

Self-verification before reporting

Anthropic's framing: Opus 4.7 "devises ways to verify its own outputs before reporting back." In practice this means the model writes tests, runs sanity checks, and inspects its own output before declaring a task complete. Vercel reports 4.7 "does proofs on systems code before starting work" — behavior not seen on 4.6. On long agentic runs, this is the single change most users report feeling first: fewer confident-but-wrong reports back.

Literal instruction following

Opus 4.7 follows instructions more literally than 4.6 or any prior Claude model. Anthropic flags this explicitly as a migration concern: prompts that depended on loose interpretation may now produce unexpected results because 4.7 takes the wording at face value. The most common failure mode is bullet lists of "suggestions" that 4.6 treated as optional hints being read as hard requirements on 4.7. Audit system prompts before rollout.

Higher-resolution vision

Images up to 2,576 pixels on the long edge (~3.75 MP) — versus ~1,568 px (~1.15 MP) on 4.6. 3.3× more pixel area per image, applied automatically through the vision API. Two practical consequences: computer-use agents can read dense screenshots without operator-side pre-cropping, and data extraction from complex diagrams jumps sharply (see the +13.4pp CharXiv-R no-tools delta).

File-system memory for multi-session work

Opus 4.7 is better at reading, writing, and reusing notes on a persistent file system across sessions. For agents that work over days rather than minutes — think a long-running engineering task that spans multiple model turns across multiple sessions — this removes the need to re-establish context at the start of every run.

New xhigh effort level

xhigh is a new tier between high and max. It gives developers finer control over the reasoning-vs-latency tradeoff: more thinking than high, without the full cost of max. Claude Code raised its default effort to xhigh for all plans on the 4.7 rollout.


Cost: Same Price, Fewer Tokens

Per-token pricing is identical. The cost story lives at the task level: Hex's early-access testing found low-effort Opus 4.7 matches medium-effort Opus 4.6 on quality, so the same completed work uses meaningfully fewer tokens at a cheaper effort tier. Anthropic's internal coding eval similarly reports token usage per completed task improved at every effort level — accuracy rises faster than token spend.

The Cost Story

Per completed task

The price per token didn't change.
The tokens per task did.

Hex's early-access testing found low-effort Opus 4.7 matches the quality of medium-effort Opus 4.6. The same finished work, with a cheaper effort tier, at an identical $5 / $25 per-token rate. Anthropic's own coding eval reports the same pattern across every effort level: accuracy rises faster than token spend.

1007550250
Opus 4.6
Medium effort
Same finished
task
0%
1007550250
Opus 4.7
Low effort

Per-token price

$5 / $25

Identical to Opus 4.6. Long-prompt rates ($10 / $37.50 above 200K) also unchanged.

Tokenizer shift

1.0  1.35×

Same input text maps to more tokens on 4.7. Content-dependent. Re-measure budgets.

Thinking at higher effort

More

Output token counts rise at high / xhigh / max on reasoning-heavy turns.

Relative cost illustrative of Hex's early-access testing. Your workload may vary.

There are two offsetting pressures to plan for. The updated tokenizer can map the same text to 1.0–1.35× more tokens than 4.6, so static token budgets built on 4.6 need re-measuring. And at higher effort levels, 4.7 thinks more than 4.6 did — output token counts can rise on reasoning-heavy turns even when the task itself is the same. On aggregate, these effects are smaller than the quality-per-effort-tier gain, but any individual workload should be benchmarked before flipping the flag at scale.

Cost leverEffect on spend vs 4.6
Per-token price (≤200K)Identical ($5 / $25 per 1M)
Tokens per task at matched qualityLower (low-effort 4.7 ≈ medium-effort 4.6)
Tokens from updated tokenizer+0–35% (content-dependent)
Tokens at matched effort levelHigher at high / xhigh / max (more thinking)

Migration: 4.6 → 4.7

Opus 4.7 is API-compatible with 4.6. The surface change is the model ID (claude-opus-4-6claude-opus-4-7). Everything else is an evaluation exercise, not a code change.

Checklist

  • Audit system prompts for literal-instruction risk. Hunt for phrases like "consider," "you might," and bullet lists of "suggestions." 4.7 reads these closer to hard requirements than 4.6 did.
  • Re-measure token budgets on real traffic. The tokenizer change can shift the same text by up to 35%. Static cost forecasts built on 4.6 will drift; dynamic budgets that meter on-the-fly are unaffected.
  • Pick an effort level intentionally. For coding and agentic use cases, start with high or xhigh. For simple classification or summarization, low on 4.7 typically matches medium on 4.6 at lower cost.
  • Turn on task budgets if you run agents. Paired with xhigh, budgets let you say "think hard on this, but don't burn more than N tokens finishing it." Prevents runaway multi-agent branches.
  • Downsample images only when needed. Opus 4.7 processes higher-res images automatically, which costs more tokens per image. Workloads that don't need the extra detail should downsample client-side before sending.
  • Run an eval pass. Especially for BrowseComp-style browsing under a specific multi-agent harness, where 4.7 can regress. A quick A/B on a held-out set catches these before production.
  • Apply to the Cyber Verification Program if relevant. Legitimate offensive-security work (vulnerability research, pen-testing, red-teaming) on 4.7 requires the new program to avoid default refusals. Not needed for defensive or general-purpose workloads.

Anthropic's official migration guide includes concrete prompt-tuning advice and token-usage comparisons per effort level.


When to Upgrade, When to Stay

Your workloadRecommendation
Agentic coding (Claude Code, Cursor, Devin-style)Upgrade. Largest gains: SWE-bench Pro +10.9pp, MCP-Atlas +14.6pp, self-verification cuts double-report errors.
Document analysis, charts, dense screenshotsUpgrade. CharXiv-R +13.4pp without tools; 3.3× vision resolution unblocks previously-cropped inputs.
Computer-use / browser agentsUpgrade with eval pass. OSWorld +5.3pp is strong, but BrowseComp −4.7pp flags harness-sensitive regressions. A/B on your real flows first.
Reasoning, knowledge, mathUpgrade. GPQA +2.9pp, HLE no-tools +6.9pp. Marginal on MMMLU (+0.4pp) but never worse.
Long-horizon autonomous workflowsUpgrade. File-system memory + self-verification are the biggest operational wins; partners report "longer autonomous runs" as the top behavioral change.
Offensive-security researchApply to Cyber Verification Program. Default 4.7 will refuse more here; the verified program exists for legitimate pen-testing use.
Fixed token-budget pipelinesRe-measure first. Tokenizer change can shift spend up to 35%. Either re-cap budgets or switch to dynamic metering before upgrading.
Prompts that rely on loose interpretationAudit first. 4.7 reads instructions more literally. Re-tune or stay on 4.6 until prompts are reviewed.

For the full announcement, see Anthropic's Opus 4.7 launch post and system card. For a deeper breakdown of the release itself, see our Opus 4.7 launch overview.

Questions

Frequently Asked Questions

  • Yes. Opus 4.7 wins on 12 of 14 reported benchmarks, with the largest jumps on MCP-Atlas (+14.6pp), CharXiv-R (+13.6pp), and SWE-bench Pro (+10.9pp). The two losses are BrowseComp (−4.7pp, sensitive to harness choice) and CyberGym (−0.7pp, intentionally flat). It also adds a new xhigh effort level, 3.3× higher-resolution vision, and self-verification on long agentic tasks.
  • Per-token pricing is identical: $5 per 1M input tokens and $25 per 1M output tokens (or $10 / $37.50 above 200K-token prompts). Opus 4.7 uses an updated tokenizer that can map the same text to 1.0–1.35× more tokens, and it thinks more at higher effort levels. On the other side of the ledger, early-access testing shows low-effort 4.7 matches medium-effort 4.6 quality, so cost per completed task typically drops.
  • For most workloads, yes — but run an eval pass first. Three things to watch: (1) prompts written for looser instruction-following can misfire because 4.7 follows instructions more literally; (2) token budgets built on 4.6 need re-measuring because of the tokenizer change; (3) if you depend on BrowseComp-style browsing under a specific harness, test before flipping the flag at scale.
  • Opus 4.6 exposed low / medium / high / max. Opus 4.7 adds xhigh between high and max, giving a middle ground for reasoning-heavy work without the full latency of max. Claude Code now defaults to xhigh for all plans.
  • Opus 4.7 ships with an updated tokenizer. The same input text maps to roughly 1.0–1.35× more tokens than Opus 4.6. Code and technical text land near the lower end of that range; heavily structured or multilingual content near the upper end. Cost forecasts built against 4.6 should be re-measured on real traffic before flipping the model flag in production.
  • Yes. Opus 4.7 accepts images up to 2,576 pixels on the long edge (~3.75 megapixels), versus ~1,568 pixels (~1.15 MP) on prior Claude models — 3.3× more pixel area. This is a model-level change applied automatically through the vision API, so images sent to Claude are processed at higher fidelity without any code change.
  • Rare cases: workloads that rely on 4.6's tokenizer for fixed budgeting, apps tuned to BrowseComp-style browsing under a specific multi-agent harness, or pipelines whose prompts lean on loose instruction interpretation. For everything else, 4.7 is the better default at the same per-token price.

Continue Reading