GPT-5.5 vs Claude Opus 4.7: Pricing, Speed, Benchmarks
I compared GPT-5.5 against Claude Opus 4.7 on every shared benchmark. Opus 4.7 leads on 6 of 10, GPT-5.5 on 4, with margins between 2 and 13 points. Pricing, time-to-first-token, throughput, and context-window behavior, all laid out.

Shared Benchmarks
10 shared benchmarks
Ten shared benchmarks.
Opus is ahead on 6, GPT-5.5 on 4.
Within seven days, I had two new frontier models to compare against the workloads I run for LLM Stats: Claude Opus 4.7 shipped on April 16, 2026, and GPT-5.5 on April 23. Both land at the same input price. Both ship 1M-token context. Both pitch significantly better behavior on long-running agentic work. The benchmark numbers don't pick a winner the way I expected. They pick a workload. This is the head-to-head: every shared benchmark, the pricing where they actually diverge, the latency story for each, and a clear rule for which model I'd default to per use case. For the live, structured side-by-side with all benchmarks, pricing tiers, and provider details, see the Claude Opus 4.7 vs GPT-5.5 comparison page.
The Verdict
On the 10 benchmarks both providers report, Opus 4.7 leads on 6 and GPT-5.5 leads on 4. The leads cluster by category, not by overall quality: Opus 4.7 is ahead on the reasoning-heavy and review-grade tests (GPQA Diamond, HLE with and without tools, SWE-Bench Pro, MCP Atlas, FinanceAgent v1.1). GPT-5.5 is ahead on the long-running tool-use tests (Terminal-Bench 2.0, BrowseComp, OSWorld-Verified, CyberGym). Margins are mostly between 2 and 13 percentage points, and every score is self-reported at each provider's high reasoning tier — comparable in shape, not in methodology.
Pricing diverges only on output. Both charge $5 per 1M input tokens on the standard tier. GPT-5.5 is $30 per 1M output; Opus 4.7 is $25 per 1M output, doubling above 200K-token prompts. Output dominates frontier-model spend, so per-token GPT-5.5 is roughly 20% more on output at matched effort. Token efficiency, retry rates, and Opus's long-prompt surcharge can flip the per-task cost in either direction depending on workload.
Latency profiles differ. In our serving data, Opus 4.7 streams its first token in ~0.5 seconds, against GPT-5.5's ~3 second baseline (inherited from GPT-5.4 per OpenAI's launch post). Per-token throughput is closer: ~42 tps for Opus, GPT-5.5 reports the same per-token speed as 5.4. For interactive surfaces, the TTFT gap is the dominant variable. For long runs, GPT-5.5's lower token-per-task count tends to close the wall-clock gap.
Side-by-Side at a Glance
The commercial surface is closer than I'd expected for a cross-vendor comparison: same input price, same context window, same modalities, same standard tier batch discount. Where they diverge is reasoning effort controls, vision resolution, and the long-prompt surcharge.
| Spec | GPT-5.5 | Claude Opus 4.7 |
|---|---|---|
| Provider | OpenAI | Anthropic |
| Release date | Apr 23, 2026 | Apr 16, 2026 |
| Model ID | gpt-5.5 | claude-opus-4-7 |
| Input / output (≤200K) | $5 / $30 per 1M | $5 / $25 per 1M |
| Input / output (>200K) | $5 / $30 per 1M (flat) | $10 / $37.50 per 1M |
| Context window (input / output) | 1M / 128K | 1M / 128K |
| Modalities | Text + image, text out | Text + image (~3.75 MP), text out |
| Reasoning controls | xhigh effort tier | low / medium / high / xhigh / max |
| Batch / Flex tier | 0.5× standard | 0.5× standard |
| Self-verification on agents | Implicit (Codex tuning) | Explicit (Plan → Execute → Verify → Report) |
| Pro / max-effort variant | GPT-5.5 Pro ($30 / $180) | Opus 4.7 max effort tier |
| Available in our proxy | API not yet live | Yes |
Pricing: Same Input, Different Output
Both models charge $5 per 1M input tokens. Output is where they diverge. GPT-5.5 sits at $30 per 1M output, Opus 4.7 at $25. Output tokens dominate frontier-model spend, so GPT-5.5 is ~20% more on output at matched effort. The picture flips above 200K-token prompts: Opus 4.7 doubles to $10 / $37.50, while GPT-5.5 holds the standard rate flat.
The Per-Token Bill
Per 1M tokens, standard tier
Same input price. 20% more on output.
Above 200K tokens
Same rate
Above 200K tokens
Input price (≤200K)
$5 = $5
Standard tier, both providers.
Output price (≤200K)
$30vs$25
Per 1M output tokens.
Above 200K tokens
flat·2×
Opus doubles to $10 / $37.50.
What actually moves the bill is rarely the sticker. It's the tokens consumed per finished task. OpenAI claims GPT-5.5 uses noticeably fewer tokens to complete the same Codex tasks than 5.4, with fewer retries on ambiguous failures. Anthropic's pitch is parallel: low-effort Opus 4.7 matches medium-effort Opus 4.6 on quality, and the model self-verifies before reporting back, which cuts confident-but-wrong reruns. Both providers are selling token efficiency on top of per-token price.
Two pricing levers worth knowing about per side:
- GPT-5.5 Batch / Flex runs at 0.5× standard. $2.50 / $15 per 1M, equal to GPT-5.4's standard rate, with the higher capability bundled in. Best for offline pipelines.
- GPT-5.5 Priority is 2.5× standard. $12.50 / $75 per 1M. Useful for time-critical interactive endpoints.
- Opus 4.7 prompt caching discounts cached prefixes. The Anthropic API caches repeated long system prompts at a reduced input rate, which is the lever that moves the bill most for workloads with a stable preamble across many requests.
- Opus 4.7 above 200K is 2× the price. If your prompts routinely cross 200K tokens, factor that into TCO. GPT-5.5 doesn't have an equivalent step.
Latency, Speed, Throughput
The two models trade on different parts of the latency curve. In our serving data, Opus 4.7 streams its first token in roughly 0.5 seconds at ~42 tps. GPT-5.5 reports the same per-token latency as GPT-5.4, which sits around a 3 second TTFT and ~50 tps in our reference data. The headline tradeoff is TTFT vs total tokens: Opus starts streaming sooner; GPT-5.5 reports fewer tokens to finish a comparable task.
Latency Profiles
One representative response · time to complete
TTFT 0.5s vs 3.0s.
TTFT advantage
0×
Opus first-token vs GPT-5.5 baseline.
Throughput
42 / 50
Tokens per second, standard tier.
Tokens per task
Fewer
5.5 finishes Codex tasks in fewer tokens than 5.4.
| Latency profile | GPT-5.5 | Claude Opus 4.7 |
|---|---|---|
| Per-token latency | Matches GPT-5.4 (per OpenAI launch post) | Streaming throughput ~42 tps in our serving data |
| Time-to-first-token | ~3s baseline (GPT-5.4 reference) | ~0.5s in our serving data |
| Faster mode | Codex Fast: 1.5× tokens/s for 2.5× cost | Effort-tier control (low → max) |
| Hardware | NVIDIA GB200 + GB300 NVL72 | Anthropic-managed serving stack |
| Long-run wall-clock | Lower retry rate, fewer tokens per task | Self-verification cuts double-reports |
The way I'd translate this for a real product: if you're building an IDE assistant or a chat surface where users care about how fast the first word appears, Opus 4.7's sub-second TTFT wins. If you're running an autonomous coding agent that has to plan, execute tools, recover from errors, and report a complete result, GPT-5.5's token efficiency tends to win end-to-end wall-clock even at slower individual generation.
Both providers offer levers that change the speed/cost tradeoff: GPT-5.5 exposes Codex Fast mode at 1.5× tokens-per-second for 2.5× cost. Opus 4.7 exposes a five-level effort tier (low / medium / high / xhigh / max) where lower effort returns sooner with less reasoning, higher effort thinks longer and uses more tokens. Different surface areas, same underlying tradeoff.
Context Window: Both 1M, Different Behavior
Both models advertise a 1M-token input context window. The headline matches; the long-context behavior doesn't fully line up because the providers report on different evaluations.
| Surface | GPT-5.5 | Claude Opus 4.7 |
|---|---|---|
| Input context window | 1,000,000 tokens | 1,000,000 tokens |
| Output max | 128,000 tokens | 128,000 tokens |
| Long-context retrieval (Graphwalks BFS >128K) | 73.7% at 256K · 45.4% at 1M | Not reported |
| Long-context recall (MRCR v2 8-needle, 512K-1M) | 74.0% | Not reported |
| Pricing above 200K | Flat at standard rate | 2× standard ($10 / $37.50 per 1M) |
| App-side surface | Codex caps at 400K | Claude Code uses xhigh effort default |
If your workload routinely sits in the 256K-1M range, GPT-5.5 is the safer default for two reasons: published recall at the long end, and flat pricing past 200K. Opus 4.7 is fully capable at long context, but Anthropic doesn't publish directly comparable retrieval scores, so there's less ground truth before you run your own evaluation.
Benchmark Head-to-Head
Every score below is self-reported by the provider that ships the model, and every benchmark name links to its live leaderboard on LLM Stats. Sources: OpenAI's GPT-5.5 launch post and Anthropic's Opus 4.7 launch post. Methodologies aren't identical (different harnesses, different tool configurations), so treat magnitudes as directional rather than precise. The shape of the wins is consistent across runs.
Coding & agentic loops
| Benchmark | GPT-5.5 | Opus 4.7 | Lead |
|---|---|---|---|
| Terminal-Bench 2.0 | 82.7% | 69.4% | GPT +13.3 |
| SWE-Bench Pro | 58.6% | 64.3% | Opus +5.7 |
| OSWorld-Verified | 78.7% | 78.0% | GPT +0.7 |
Terminal-Bench is the largest swing in either direction (+13.3pp for GPT-5.5), on a benchmark that scores unattended shell-driven tasks where the model has to plan, execute, recover from failed commands, and verify its own state. SWE-Bench Pro moves the other way (+5.7pp for Opus 4.7), on real-repo PR-style tasks where a single careful patch matters more than loop length. OSWorld lands inside a percentage point of a tie. Pick the benchmark closest to your actual deployment shape and the answer follows.
Reasoning & knowledge
| Benchmark | GPT-5.5 | Opus 4.7 | Lead |
|---|---|---|---|
| GPQA Diamond | 93.6% | 94.2% | Opus +0.6 |
| HLE (no tools) | 41.4% | 46.9% | Opus +5.5 |
| HLE (with tools) | 52.2% | 54.7% | Opus +2.5 |
The HLE no-tools margin (+5.5pp) is the most informative entry in the table because it isolates the model's reasoning from any tool-use scaffolding. GPQA Diamond is approaching the ceiling on both models, so the 0.6pp gap there is inside the noise of a single seed. On graduate-level single-question work where the answer matters more than the path the model took to get there, Opus 4.7 is the better default.
Web, browsing, agents
| Benchmark | GPT-5.5 | Opus 4.7 | Lead |
|---|---|---|---|
| BrowseComp | 84.4% | 79.3% | GPT +5.1 |
| MCP Atlas | 75.3% | 77.3% | Opus +2.0 |
| FinanceAgent v1.1 | 60.0% | 64.4% | Opus +4.4 |
| CyberGym | 81.8% | 73.1% | GPT +8.7 |
GPT-5.5 leads on BrowseComp (+5.1pp) and CyberGym (+8.7pp), the two benchmarks closest to OpenAI's framing of 5.5 as their strongest autonomous-loop model. Opus 4.7 leads on MCP Atlas (+2.0pp) and FinanceAgent v1.1 (+4.4pp), which align with Anthropic's emphasis on self-verification before reporting back on long-horizon tasks. The margins are small enough on three of these four that I would re-run either side's number on a held-out task set before treating any single benchmark as decisive.
Vision: 3.75 MP vs Standard
Opus 4.7 reads images at roughly 3.3× the resolution of any comparable model. Up to 2,576 pixels on the long edge (~3.75 megapixels), versus ~1,568 px (~1.15 MP) on prior Claude models. Scores align: Opus 4.7 reports 91.0% on CharXiv-R with tools and 82.1% without. GPT-5.5 supports image input but holds the GPT-5.4 envelope and reports MMMU Pro 81.2% (no tools), 83.2% (with tools).
| Vision capability | GPT-5.5 | Claude Opus 4.7 |
|---|---|---|
| Max image resolution | GPT-5.4-class (~1.15 MP) | ~3.75 MP (2,576 px long edge) |
| Best chart-reading score | MMMU Pro 81.2% / 83.2% (with tools) | CharXiv-R 91.0% with tools, 82.1% without |
| Best for | Standard image inputs | Dense screenshots, diagrams, IDE captures |
For workloads built around dense visual input (computer-use agents reading full-resolution screenshots, financial analysis on charts, document extraction from scans), Opus 4.7 is the right default. For typical text-plus-image workloads, both models clear the bar.
Which Model for Which Workload
The rest of this comparison resolves into one decision per workload. Below is the matrix I actually use when picking which API to point a new product surface at.
The Decision Matrix
Six workloads · One pick each
Six workloads. Six default picks, by category.
Agentic coding loops
When the model has to drive a terminal, run tests, and recover from its own bad calls.
- Terminal-Bench 2.0 — 82.7% vs 69.4%
- OSWorld-Verified — 78.7% vs 78.0%
- Codex tuned to use fewer tokens per task
Real-world software engineering on a repo
When the goal is a clean PR against a non-trivial codebase, not just a green test in a sandbox.
- SWE-Bench Pro — 64.3% vs 58.6%
- MCP Atlas — 77.3% vs 75.3%
- FinanceAgent v1.1 — 64.4% vs 60.0%
Hard reasoning, math, science
When you need the right answer the first time, on a graduate-level question.
- GPQA Diamond — 94.2% vs 93.6%
- HLE no tools — 46.9% vs 41.4%
- HLE with tools — 54.7% vs 52.2%
Long-running web research and browsing
When the agent has to read pages, follow links, and synthesize across messy sources.
- BrowseComp — 84.4% vs 79.3%
- CyberGym — 81.8% vs 73.1%
- Matched per-token latency on long horizons
Dense screenshots, diagrams, charts
When the input is a 1440p IDE capture, an architecture diagram, or a financial chart.
- CharXiv-R with tools — 91.0% vs MMMU Pro 83.2%
- Image input up to 2,576 px on the long edge (~3.75 MP)
- Per Anthropic, ~3.3× the resolution of prior Claude models
Cost-per-task at scale
When you bill thousands of completions a day and per-token math actually matters.
- Codex tuned to use fewer tokens per finished task
- Batch / Flex tier at 0.5× standard pricing
- Flat output pricing past 200K tokens
A few cross-cutting rules of thumb on top of the matrix:
- If the model is going to be reviewed by a human (legal briefs, scientific writeups, financial analysis), default to Opus 4.7. Better single-shot accuracy, better self-verification, better fit for review-grade output.
- If the model is going to drive a tool loop unattended (terminal automation, data pipelines, multi-step web research), default to GPT-5.5. Better Terminal-Bench, better BrowseComp, lower retry rate per task.
- If the workload is latency-sensitive on the first token (chat surfaces, IDE assistants), default to Opus 4.7. Sub-second TTFT tends to feel snappier even at slower sustained throughput.
- If the workload includes dense visual inputs (screenshots, diagrams, IDE captures, financial charts), default to Opus 4.7. The 3.3× resolution advantage is real and shows up in CharXiv-R.
- If the workload routinely exceeds 200K tokens, default to GPT-5.5. Flat output pricing past 200K vs Opus 4.7's 2× surcharge shifts the per-task economics.
Other Frontier Options
GPT-5.5 and Opus 4.7 aren't the only frontier-tier choices in April 2026. Two close alternatives worth keeping in the picture, depending on how extreme your accuracy or budget constraint is:
| Model | Why it's in the conversation | Compare |
|---|---|---|
| GPT-5.5 Pro | $30 / $180 per 1M. Reach for it on hardest-question single-shot accuracy where the answer can't recover from a wrong first try. | llm-stats.com/models/gpt-5.5-pro |
| Claude Opus 4.6 | Same per-token price as 4.7, with mature serving stack, no tokenizer shift. The right default if you can't spend the eval cycles on a migration right now. | Opus 4.7 vs 4.6 |
| GPT-5.4 | Half the per-token price of GPT-5.5 ($2.50 / $15). The right default for high-volume saturated workloads (summarization, classification, extraction) where 5.4 already lands inside the capability envelope. | GPT-5.5 vs GPT-5.4 |
For full structured benchmark data, see the model pages on GPT-5.5 and Claude Opus 4.7, or go straight to the live Claude Opus 4.7 vs GPT-5.5 comparison on LLM Stats. Primary sources: the GPT-5.5 announcement, the Opus 4.7 announcement, and the OpenAI pricing page.
Questions
Frequently Asked Questions
- On the 10 benchmarks both providers report, Opus 4.7 leads on 6 (GPQA, HLE no tools, HLE with tools, SWE-Bench Pro, MCP Atlas, FinanceAgent v1.1) and GPT-5.5 leads on 4 (Terminal-Bench 2.0, BrowseComp, OSWorld-Verified, CyberGym). Opus 4.7's leads cluster on reasoning-heavy and review-grade tests; GPT-5.5's leads cluster on long-running tool-use and shell-driven tasks. The right one depends on the workload, not on a single overall ranking.
- Both list at $5 per 1M input tokens on the standard tier. GPT-5.5 is $30 per 1M output tokens, Opus 4.7 is $25 per 1M output. Above 200K tokens, Opus 4.7 doubles to $10 / $37.50. GPT-5.5 holds a flat rate at the standard tier and offers Batch / Flex at 0.5× of standard, which Opus matches with its own batch tier.
- Yes. Both ship a 1,000,000-token input context window and 128K output tokens on the standard API. Long-context behavior differs in practice: GPT-5.5 self-reports 73.7% on Graphwalks BFS at 256K and 45.4% at 1M, while Opus 4.7 doesn't publish a directly comparable long-context score. If 256K-1M traffic is core to your workload, run your own retrieval evaluation before picking.
- It depends which part of the latency curve. In our serving data, Opus 4.7 has the lower time-to-first-token (~0.5s vs the ~3s GPT-5.4 baseline that GPT-5.5 inherits per OpenAI's launch post). Per-token throughput is closer (~42 vs ~50 tps). For interactive surfaces the TTFT gap dominates; for long autonomous runs, GPT-5.5's fewer-tokens-per-task profile tends to close the wall-clock gap.
- Depends on the deployment shape. For unattended terminal and shell workflows, GPT-5.5 leads on Terminal-Bench 2.0 (82.7% vs 69.4%). For real-repo PR-style software engineering, Opus 4.7 leads on SWE-Bench Pro (64.3% vs 58.6%). I default to GPT-5.5 when the model is going to drive the loop end-to-end, and to Opus 4.7 when the output is a single careful patch a human is going to review.
- Opus 4.7. It accepts images up to 2,576 pixels on the long edge (~3.75 MP), roughly 3.3× the prior Claude resolution, and posts 91.0% on CharXiv-R with tools. GPT-5.5 supports image input but holds the GPT-5.4 resolution envelope. For dense screenshots, financial charts, or detailed diagrams, Opus 4.7 is the right default.
- GPT-5.5 Pro is $30 / $180 per 1M tokens, six times base 5.5. Reach for it on hardest-question single-shot accuracy: legal review, financial analysis, scientific research where the next experiment depends on the answer. For most normal frontier workloads, base GPT-5.5 or Opus 4.7 is the better cost-quality point.
Continue Reading
