GPT-5.5 vs GPT-5.4: Pricing, Speed, Context, Benchmarks
I compared GPT-5.5 vs GPT-5.4 head-to-head: 2× the per-token price, same per-token latency in real-world serving, identical 1M-token context window, and improvements on 9 of 10 shared benchmarks. Where the upgrade pays for itself, and where 5.4 stays the better default.

Shared Benchmarks
GPT-5.4 / GPT-5.5
Ten shared benchmarks.
GPT-5.5 improves on nine.
OpenAI released GPT-5.5 on April 23, 2026, seven weeks after GPT-5.4. I've been running both against the same Codex workloads I use to evaluate every new frontier release. The per-token price doubled. The per-token latency didn't. GPT-5.5 improves on 9 of the 10 benchmarks I can compare directly, with the largest gains on ARC-AGI-2, MCP Atlas, and Terminal-Bench 2.0. This post walks through every spec, every shared benchmark, the latency claim and what it actually means in practice, and the workload where I'd still default to 5.4. For the structured live side-by-side with full benchmark scores, pricing tiers, and provider details, see the GPT-5.4 vs GPT-5.5 comparison page.
The Verdict
The surface didn't change. Same 1M-token API context, same text and image input modalities, same Responses and Chat Completions endpoints, same Pro variant pattern. The per-token price doubled: $5 / $30 per 1M on standard, vs $2.50 / $15 for GPT-5.4. The case for paying it is two specific things. First, GPT-5.5 improves on 9 of the 10 shared benchmarks, with +11.7pp on ARC-AGI-2, +8.1pp on MCP Atlas, and +7.6pp on Terminal-Bench 2.0. Second, OpenAI says GPT-5.5 finishes the same Codex tasks with fewer tokens and at the same per-token latency as 5.4 in real-world serving. Net of token efficiency, my Codex bill on real engineering tasks moved nowhere near 2×.
Where I'd still default to 5.4: high-volume, latency-priced endpoints where 5.4 already sits inside the capability envelope — summarization, intent classification, structured extraction at scale, and anything close to the saturated benchmarks 5.5 is supposed to win on. On workloads that don't exercise the extra reasoning, the 2× sticker doesn't buy anything.
Side-by-Side at a Glance
Nothing on the surface changed. Same 1M-token API context, same image+text modalities, same Responses and Chat Completions APIs, same NVIDIA serving stack. The difference is intelligence-per-token and a fully reworked inference path that absorbs the larger model without a latency penalty.
| Spec | GPT-5.4 | GPT-5.5 |
|---|---|---|
| Release date | Mar 5, 2026 | Apr 23, 2026 |
| Model ID | gpt-5.4 | gpt-5.5 |
| Standard input / output price | $2.50 / $15.00 per 1M | $5.00 / $30.00 per 1M |
| Batch & Flex pricing | 0.5× standard | 0.5× standard |
| Priority pricing | 2.5× standard | 2.5× standard |
| API context window | 1M input / 128K output | 1M input / 128K output |
| Codex context window | — | 400K |
| Modalities | Text + image in, text out | Text + image in, text out |
| Per-token latency (real-world serving) | Reference | Matches GPT-5.4 |
| Codex Fast mode | — | 1.5× tokens/s for 2.5× cost |
| Pro variant | GPT-5.4 Pro | GPT-5.5 Pro ($30 / $180 per 1M) |
| Serving hardware | NVIDIA GB200 NVL72 | NVIDIA GB200 + GB300 NVL72 |
Pricing: Sticker vs Bill
GPT-5.5 lists at exactly 2× the per-token price of GPT-5.4 on both sides of the meter: $5 input vs $2.50, $30 output vs $15. Output tokens dominate frontier-model spend, so the $15 → $30 step is the change that shows up first on a finance review.
Per-Token Pricing
Per 1M tokens · Standard tier
$2.50 / $15 → $5 / $30 per 1M.
Sticker delta
0×
On input and output.
Tokens per Codex task
Fewer
5.5 finishes with fewer tokens, fewer retries.
Volume discounts
0.5×
Batch and Flex tier. Priority is 2.5×.
The per-task cost moves less than the sticker suggests, for one specific reason: OpenAI reports that GPT-5.5 reaches higher-quality outputs with fewer tokens than 5.4 across all three of their reported coding evals, and tunes Codex specifically so 5.5 delivers better results in fewer tokens for most users. On the Codex workloads I run most often, 5.5 also aborts retry loops sooner on ambiguous failures where 5.4 would previously keep digging. Token-efficiency gains land on the same line as the per-token rate increase, so the bill ends up below 2× on real workloads.
A few details worth pricing in before you compare line items:
- Batch and Flex run at half the standard rate. If you have any large-volume offline workloads (eval grading, content backfills, batch summarization), Batch ends GPT-5.5 at $2.50 / $15 per 1M, equal to GPT-5.4's standard rate.
- Priority is 2.5× standard. Useful for time-critical interactive endpoints where queueing dominates latency.
- Codex Fast mode is 1.5× tokens/s for 2.5× cost. Rarely wins on price; usually wins on user-perceived latency in IDE settings.
- GPT-5.5 Pro is $30 / $180 per 1M. A 6× jump from base 5.5, sold on hardest-question accuracy, not throughput. See the section below for when it pays for itself.
| Tier | Input ($/1M) | Output ($/1M) | vs Standard |
|---|---|---|---|
| GPT-5.5 Standard | $5.00 | $30.00 | 1.0× |
| GPT-5.5 Batch / Flex | $2.50 | $15.00 | 0.5× |
| GPT-5.5 Priority | $12.50 | $75.00 | 2.5× |
| GPT-5.5 Pro | $30.00 | $180.00 | 6× base |
| GPT-5.4 Standard (reference) | $2.50 | $15.00 | — |
Latency, Speed, Throughput
OpenAI says GPT-5.5 generates tokens at the same per-token latency as GPT-5.4 in real-world serving, despite being a meaningfully larger and more capable model. That isn't the typical generational shape, where a smarter model trades off some throughput.
Per-Token Latency
Real-world serving
Same per-token latency as GPT-5.4.
Default
1.0×
Per-token, real-world serving.
Codex Fast mode
1.5× faster
1.5× speed at 2.5× rate.
Priority API
2.5×
Standard rate, priority processing.
The why, from OpenAI: serving GPT-5.5 at GPT-5.4 latency required treating inference as an integrated system rather than a stack of point optimizations. GPT-5.5 was co-designed for and served on NVIDIA GB200 and GB300 NVL72 systems, and the team used Codex itself, plus GPT-5.5, to find and implement key improvements in the serving stack. The model helped tune the infrastructure that serves it.
What this means in practice for the workloads I care about:
- Interactive coding loops feel the same. If your IDE integration was tuned around 5.4's per-token cadence, switching to 5.5 doesn't change perceived latency. The wins show up as fewer iterative turns to reach the right answer.
- Long-running agents finish sooner end-to-end. Same per-token speed, fewer tokens per task, fewer retries on ambiguous failures. Wall-clock to completion drops.
- High-throughput batch jobs see the same throughput per dollar shift. Batch pricing puts 5.5 at GPT-5.4's old standard rate, with the higher capability bundled in.
- Codex Fast mode is a real lever. 1.5× tokens-per-second for 2.5× cost. Not a deal in pure cost-per-token terms. A clean win for interactive flows where waiting is what bothers users.
Context Window: 1M vs 400K vs Codex
Both models support a 1M-token context window on the API. Codex tops out at 400K. Two surfaces, two limits, easy to confuse.
| Surface | GPT-5.4 | GPT-5.5 |
|---|---|---|
| Responses + Chat Completions API | 1M input / 128K output | 1M input / 128K output |
| Codex (Plus/Pro/Business/Enterprise/Edu/Go) | — | 400K |
| ChatGPT | Per-tier limits | Per-tier limits (Pro variant available) |
Context behavior at the long end is where 5.5's self-reported numbers really separate from 5.4. On Graphwalks BFS at the >128K bucket, 5.4 dropped to 21.4%; 5.5 reports 73.7% at 256K and 45.4% at 1M. The MRCR v2 8-needle curve, plotted across every context bucket OpenAI publishes, tells the cleaner version of the same story.
Long-Context Recall
MRCR v2 8-needle · GPT-5.5
MRCR v2 8-needle recall, by context bucket.
Best (4-8K)
98.1%
Short-context bucket.
128-256K
87.5%
Typical enterprise long-context.
512K-1M
74.0%
Deepest published bucket.
If you're running real long-context work on the API (large codebase ingestion, multi-document research, long agent traces), this is where the upgrade earns its sticker. The curve is the difference between "1M context window" as a marketing line and "1M context window" as a working capability.
Benchmark Deltas
Every benchmark below is self-reported and linked to the live LLM Stats leaderboard for that test. Sources: the GPT-5.4 launch post, the GPT-5.5 launch post. Both are run at the "xhigh" reasoning effort tier where reported.
Agentic coding & computer use
| Benchmark | GPT-5.4 | GPT-5.5 | Delta |
|---|---|---|---|
| Terminal-Bench 2.0 | 75.1% | 82.7% | +7.6 |
| SWE-Bench Pro (Public) | 57.7% | 58.6% | +0.9 |
| OSWorld-Verified | 75.0% | 78.7% | +3.7 |
| BrowseComp | 82.7% | 84.4% | +1.7 |
| MCP Atlas | 67.2% | 75.3% | +8.1 |
| Toolathlon | 54.6% | 55.6% | +1.0 |
The Terminal-Bench 2.0 jump (+7.6pp) and MCP Atlas jump (+8.1pp) line up with what OpenAI claimed at launch about agentic coding and tool use. The SWE-Bench Pro delta is small (+0.9pp), but SWE-Bench Pro is the benchmark where most frontier models cluster within a percentage point or two of each other right now; the more useful comparison there is per-task tokens consumed, which 5.5 reports lower across all three of OpenAI's coding evals.
Reasoning, math, science
| Benchmark | GPT-5.4 | GPT-5.5 | Delta |
|---|---|---|---|
| ARC-AGI-1 (Verified) | 93.7% | 95.0% | +1.3 |
| ARC-AGI-2 (Verified) | 73.3% | 85.0% | +11.7 |
| GPQA Diamond | 92.8% | 93.6% | +0.8 |
| FrontierMath (T1-3) | 47.6% | 51.7% | +4.1 |
| Humanity's Last Exam (no tools) | 39.8% | 41.4% | +1.6 |
| MMMU Pro (no tools) | 81.2% | 81.2% | ±0 |
ARC-AGI-2 moves the most: +11.7pp, on a benchmark designed to resist saturation. GPQA Diamond is approaching the ceiling, and MMMU Pro is flat (no-tools); both are workloads where 5.4 was already at the practical capability frontier. FrontierMath gains +4.1pp on tiers 1-3, and a new Tier 4 number (35.4% on 5.5) opens a category that didn't exist as a separate score on 5.4's scorecard.
Specialized agents & long-tail tasks
| Benchmark | GPT-5.4 | GPT-5.5 | Delta |
|---|---|---|---|
| FinanceAgent v1.1 | 56.0% | 60.0% | +4.0 |
| Tau2-bench Telecom | 98.9% | 98.0% | −0.9 |
Tau2-bench Telecom is the only regression. It's also already at 98.9% on 5.4, which means there's ~1.1pp of headroom and 5.5 lands inside measurement noise of it. A nominal regression, not a real one.
Benchmarks new on 5.5 (no GPT-5.4 number to compare)
| Benchmark | GPT-5.5 | Notes |
|---|---|---|
| GDPval (wins or ties) | 84.9% | OpenAI's economically-weighted task evaluation. |
| OfficeQA Pro | 54.1% | Knowledge-work tasks across business domains. |
| BixBench | 80.5% | Real-world bioinformatics data analysis. |
| GeneBench | 25.0% | Multi-stage genetics analysis under noisy data. |
| CyberGym | 81.8% | Cybersecurity vulnerability reproduction. |
What Actually Changed
The benchmark deltas explain the score. The behavioral changes explain why tasks I run regularly feel different on 5.5 even where the benchmark numbers are close.
Token efficiency on long-horizon work
OpenAI's framing: GPT-5.5 is "more efficient in how it works through problems, often reaching higher-quality outputs with fewer tokens and fewer retries." On the three coding evals OpenAI reports, GPT-5.5 improves on GPT-5.4's scores while using fewer tokens. In Codex specifically, OpenAI tuned the experience so 5.5 delivers better results with fewer tokens than 5.4 for most users. This is the single change I notice most often: agentic loops converge sooner and dig themselves in less when they're wrong.
Latency parity with a much larger model
Same per-token latency at much higher capability is how OpenAI is selling the release, and the engineering claim behind it is concrete: rebuilt inference stack, co-design with NVIDIA GB200 and GB300 NVL72 systems, Codex and 5.5 used internally to find and implement key improvements. The model helped improve the infrastructure that serves it.
Codex defaults and Fast mode
GPT-5.5 ships in Codex with a 400K context window across Plus, Pro, Business, Enterprise, Edu, and Go plans. Fast mode is a new lever: 1.5× tokens-per-second for 2.5× cost. The cost arithmetic only works in interactive scenarios where waiting cost > serving cost.
API surface unchanged, modalities unchanged
GPT-5.5 is a drop-in API replacement for 5.4. Same Responses and Chat Completions endpoints, same text and image input modalities, text output, same 1M input / 128K output ceiling. No code change beyond the model ID.
Where GPT-5.5 Pro Fits
GPT-5.5 Pro is for the prompts that can't afford to be wrong on the first try. Same context window. Same modalities. 6× the per-token price of base 5.5 ($30 input, $180 output per 1M). The pitch is more comprehensive, better-structured, more accurate responses on hard single-question work, with especially strong performance in business, legal, education, and data science.
When I'd reach for it: legal review on a long contract, financial analysis where the answer is going into a slide deck, scientific research where the next paper experiment depends on the right interpretation of a dataset. When I wouldn't: anything in a tight latency budget, anything in a tight cost budget, anything where the workflow can iterate and recover.
| Variant | Input ($/1M) | Output ($/1M) | Best for |
|---|---|---|---|
| GPT-5.5 | $5 | $30 | Default frontier workload, agentic coding, computer use |
| GPT-5.5 Pro | $30 | $180 | Hardest-question accuracy, research-grade analysis |
When to Upgrade, When to Stay
| Your workload | Recommendation |
|---|---|
| Agentic coding (Codex, Cursor, Devin-style) | Upgrade to 5.5. Terminal-Bench +7.6pp, MCP Atlas +8.1pp, fewer tokens per finished task. The largest sustained improvement across any single workload. |
| Computer-use / browser agents | Upgrade to 5.5. OSWorld +3.7pp and BrowseComp +1.7pp, with the bigger win sitting in fewer recovery loops on ambiguous pages. |
| Long-context / 256K-1M token workloads | Upgrade to 5.5. Graphwalks BFS at 256K jumps from 21.4% (5.4) to 73.7%; MRCR v2 8-needle holds 74.0% at 512K-1M. This is where the upgrade pays for itself most clearly. |
| Scientific research / quantitative analysis | Upgrade to 5.5; consider 5.5 Pro on hardest tasks. FrontierMath +4.1pp, BixBench at 80.5%, HLE +1.6pp without tools. |
| High-volume summarization, classification, extraction | Stay on 5.4. 5.4 is already at or near saturation on these tasks. The 2× sticker buys nothing on workloads that aren't using the extra reasoning. |
| Customer-support style turn-by-turn (saturated benchmarks) | Stay on 5.4. Tau2-bench Telecom at 98.9% on 5.4 regresses to 98.0% on 5.5. Inside noise, but no upgrade case. |
| Hardest-question single-shot accuracy | Use 5.5 Pro. 6× sticker, but the right tool when the workflow can't recover from a wrong first answer. |
| Latency-critical interactive coding (IDE) | Upgrade to 5.5; turn on Fast mode in Codex. 1.5× tokens/s for 2.5× cost is worth it for IDE flows where wait time is user-visible. |
Primary sources: OpenAI's GPT-5.5 launch post, the GPT-5.4 launch post, and the OpenAI pricing page. For full structured benchmark data, see the GPT-5.5 model page, the GPT-5.4 model page, and the live GPT-5.4 vs GPT-5.5 comparison on LLM Stats.
Questions
Frequently Asked Questions
- On the benchmarks both models report, yes. GPT-5.5 wins on 9 of 10 shared evaluations, with the largest jumps on ARC-AGI-2 (+11.7pp), MCP Atlas (+8.1pp), and Terminal-Bench 2.0 (+7.6pp). The single regression is Tau2-bench Telecom (-0.9pp) from a 98.9% saturation point on 5.4. The bigger story sits below the benchmarks: OpenAI says GPT-5.5 uses noticeably fewer tokens to finish the same Codex tasks, and matches GPT-5.4 per-token latency in real-world serving.
- GPT-5.5 lists at $5 per 1M input tokens and $30 per 1M output tokens on the standard API. GPT-5.4 sits at $2.50 / $15 per 1M. That is exactly 2× the per-token price, on both sides of the meter. Batch and Flex run at half the standard rate, Priority at 2.5×. OpenAI also offers GPT-5.5 Pro at $30 / $180 per 1M, a 6× step up from base 5.5 for hardest-question accuracy.
- No. OpenAI says GPT-5.5 matches GPT-5.4 per-token latency in real-world serving, despite being a meaningfully larger and smarter model. The trick was co-design with NVIDIA GB200 / GB300 NVL72 systems and treating inference as one integrated system. In Codex, an opt-in Fast mode generates tokens 1.5× faster for 2.5× the cost.
- On the API, GPT-5.5 supports a 1,000,000-token context window, identical to GPT-5.4, with up to 128K output tokens. In Codex, the surface limit is 400,000 tokens across Plus, Pro, Business, Enterprise, Edu, and Go plans.
- For agentic coding, computer use, and long-running reasoning, yes — most workloads will see fewer retries and shorter end-to-end runs even at the higher sticker. For simple summarization, classification, or any high-volume latency-sensitive endpoint that already lands within GPT-5.4's capability envelope, GPT-5.4 stays the better default. The right move on a real codebase is an A/B run on a held-out set with the actual prompts and actual tool calls before flipping the flag.
- GPT-5.5 Pro is the higher-accuracy variant, priced at $30 per 1M input and $180 per 1M output. It uses more reasoning compute than base 5.5 and is targeted at the hardest single-shot questions in business, legal, education, and data science. Reach for it when accuracy matters more than throughput and the answer is going to a human or to a downstream pipeline that can't recover from a mistake.
- Yes. GPT-5.5 accepts both text and image inputs through the same Responses and Chat Completions APIs that GPT-5.4 used, and outputs text. Audio and video remain out of scope for the base API surface.
Continue Reading
