Is GPT-5.5 better than GPT-5.4?

On the benchmarks both models report, yes. GPT-5.5 wins on 9 of 10 shared evaluations, with the largest jumps on ARC-AGI-2 (+11.7pp), MCP Atlas (+8.1pp), and Terminal-Bench 2.0 (+7.6pp). The single regression is Tau2-bench Telecom (-0.9pp) from a 98.9% saturation point on 5.4. The bigger story sits below the benchmarks: OpenAI says GPT-5.5 uses noticeably fewer tokens to finish the same Codex tasks, and matches GPT-5.4 per-token latency in real-world serving.

How much does GPT-5.5 cost vs GPT-5.4?

GPT-5.5 lists at $5 per 1M input tokens and $30 per 1M output tokens on the standard API. GPT-5.4 sits at $2.50 / $15 per 1M. That is exactly 2× the per-token price, on both sides of the meter. Batch and Flex run at half the standard rate, Priority at 2.5×. OpenAI also offers GPT-5.5 Pro at $30 / $180 per 1M, a 6× step up from base 5.5 for hardest-question accuracy.

Is GPT-5.5 slower than GPT-5.4?

No. OpenAI says GPT-5.5 matches GPT-5.4 per-token latency in real-world serving, despite being a meaningfully larger and smarter model. The trick was co-design with NVIDIA GB200 / GB300 NVL72 systems and treating inference as one integrated system. In Codex, an opt-in Fast mode generates tokens 1.5× faster for 2.5× the cost.

What is the GPT-5.5 context window?

On the API, GPT-5.5 supports a 1,000,000-token context window, identical to GPT-5.4, with up to 128K output tokens. In Codex, the surface limit is 400,000 tokens across Plus, Pro, Business, Enterprise, Edu, and Go plans.

Should I migrate from GPT-5.4 to GPT-5.5?

For agentic coding, computer use, and long-running reasoning, yes — most workloads will see fewer retries and shorter end-to-end runs even at the higher sticker. For simple summarization, classification, or any high-volume latency-sensitive endpoint that already lands within GPT-5.4's capability envelope, GPT-5.4 stays the better default. The right move on a real codebase is an A/B run on a held-out set with the actual prompts and actual tool calls before flipping the flag.

What is GPT-5.5 Pro and when should I use it?

GPT-5.5 Pro is the higher-accuracy variant, priced at $30 per 1M input and $180 per 1M output. It uses more reasoning compute than base 5.5 and is targeted at the hardest single-shot questions in business, legal, education, and data science. Reach for it when accuracy matters more than throughput and the answer is going to a human or to a downstream pipeline that can't recover from a mistake.

Does GPT-5.5 support image input?

Yes. GPT-5.5 accepts both text and image inputs through the same Responses and Chat Completions APIs that GPT-5.4 used, and outputs text. Audio and video remain out of scope for the base API surface.

Back to blog

Comparison·Technical Deep Dive

GPT-5.5 vs GPT-5.4: Pricing, Speed, Context, Benchmarks

I compared GPT-5.5 vs GPT-5.4 head-to-head: 2× the per-token price, same per-token latency in real-world serving, identical 1M-token context window, and improvements on 9 of 10 shared benchmarks. Where the upgrade pays for itself, and where 5.4 stays the better default.

Jonathan Chavez

Co-Founder @ LLM Stats

Apr 23, 2026·11 min read

Shared Benchmarks

GPT-5.4 / GPT-5.5

Ten shared benchmarks.
GPT-5.5 improves on nine.

0/ 10

100

ARC-AGI-2

73.3

85.0

+11.7

MCP Atlas

67.2

75.3

+8.1

Terminal-Bench 2.0

75.1

82.7

+7.6

FrontierMath (T1-3)

47.6

51.7

+4.1

FinanceAgent v1.1

56.0

60.0

+4.0

OSWorld-Verified

75.0

78.7

+3.7

BrowseComp

82.7

84.4

+1.7

GPQA Diamond

92.8

93.6

+0.8

HLE (no tools)

39.8

41.4

+1.6

Tau2-bench Telecom

98.9

98.0

-0.9

GPT-5.4GPT-5.5 — gainregressionSelf-reported, OpenAI launch posts

OpenAI released GPT-5.5 on April 23, 2026, seven weeks after GPT-5.4. I've been running both against the same Codex workloads I use to evaluate every new frontier release. The per-token price doubled. The per-token latency didn't. GPT-5.5 improves on 9 of the 10 benchmarks I can compare directly, with the largest gains on ARC-AGI-2, MCP Atlas, and Terminal-Bench 2.0. This post walks through every spec, every shared benchmark, the latency claim and what it actually means in practice, and the workload where I'd still default to 5.4. For the structured live side-by-side with full benchmark scores, pricing tiers, and provider details, see the GPT-5.4 vs GPT-5.5 comparison page.

The Verdict

The surface didn't change. Same 1M-token API context, same text and image input modalities, same Responses and Chat Completions endpoints, same Pro variant pattern. The per-token price doubled: $5 / $30 per 1M on standard, vs $2.50 / $15 for GPT-5.4. The case for paying it is two specific things. First, GPT-5.5 improves on 9 of the 10 shared benchmarks, with +11.7pp on ARC-AGI-2, +8.1pp on MCP Atlas, and +7.6pp on Terminal-Bench 2.0. Second, OpenAI says GPT-5.5 finishes the same Codex tasks with fewer tokens and at the same per-token latency as 5.4 in real-world serving. Net of token efficiency, my Codex bill on real engineering tasks moved nowhere near 2×.

Where I'd still default to 5.4: high-volume, latency-priced endpoints where 5.4 already sits inside the capability envelope — summarization, intent classification, structured extraction at scale, and anything close to the saturated benchmarks 5.5 is supposed to win on. On workloads that don't exercise the extra reasoning, the 2× sticker doesn't buy anything.

Side-by-Side at a Glance

Nothing on the surface changed. Same 1M-token API context, same image+text modalities, same Responses and Chat Completions APIs, same NVIDIA serving stack. The difference is intelligence-per-token and a fully reworked inference path that absorbs the larger model without a latency penalty.

Spec	GPT-5.4	GPT-5.5
Release date	Mar 5, 2026	Apr 23, 2026
Model ID	`gpt-5.4`	`gpt-5.5`
Standard input / output price	$2.50 / $15.00 per 1M	$5.00 / $30.00 per 1M
Batch & Flex pricing	0.5× standard	0.5× standard
Priority pricing	2.5× standard	2.5× standard
API context window	1M input / 128K output	1M input / 128K output
Codex context window	—	400K
Modalities	Text + image in, text out	Text + image in, text out
Per-token latency (real-world serving)	Reference	Matches GPT-5.4
Codex Fast mode	—	1.5× tokens/s for 2.5× cost
Pro variant	GPT-5.4 Pro	GPT-5.5 Pro ($30 / $180 per 1M)
Serving hardware	NVIDIA GB200 NVL72	NVIDIA GB200 + GB300 NVL72

Pricing: Sticker vs Bill

GPT-5.5 lists at exactly 2× the per-token price of GPT-5.4 on both sides of the meter: $5 input vs $2.50, $30 output vs $15. Output tokens dominate frontier-model spend, so the $15 → $30 step is the change that shows up first on a finance review.

Per-Token Pricing

Per 1M tokens · Standard tier

$2.50 / $15 → $5 / $30 per 1M.

GPT-5.4

Mar 5, 2026 · Standard API

$2.50

out

$15

GPT-5.5

Apr 23, 2026 · Standard API

out

$30

Sticker delta

0×

On input and output.

Tokens per Codex task

Fewer

5.5 finishes with fewer tokens, fewer retries.

Volume discounts

0.5×

Batch and Flex tier. Priority is 2.5×.

Standard API pricing from openai.com/api/pricing. Codex token-efficiency claims from the GPT-5.5 launch announcement.

The per-task cost moves less than the sticker suggests, for one specific reason: OpenAI reports that GPT-5.5 reaches higher-quality outputs with fewer tokens than 5.4 across all three of their reported coding evals, and tunes Codex specifically so 5.5 delivers better results in fewer tokens for most users. On the Codex workloads I run most often, 5.5 also aborts retry loops sooner on ambiguous failures where 5.4 would previously keep digging. Token-efficiency gains land on the same line as the per-token rate increase, so the bill ends up below 2× on real workloads.

A few details worth pricing in before you compare line items:

Batch and Flex run at half the standard rate. If you have any large-volume offline workloads (eval grading, content backfills, batch summarization), Batch ends GPT-5.5 at $2.50 / $15 per 1M, equal to GPT-5.4's standard rate.
Priority is 2.5× standard. Useful for time-critical interactive endpoints where queueing dominates latency.
Codex Fast mode is 1.5× tokens/s for 2.5× cost. Rarely wins on price; usually wins on user-perceived latency in IDE settings.
GPT-5.5 Pro is $30 / $180 per 1M. A 6× jump from base 5.5, sold on hardest-question accuracy, not throughput. See the section below for when it pays for itself.

Tier	Input ($/1M)	Output ($/1M)	vs Standard
GPT-5.5 Standard	$5.00	$30.00	1.0×
GPT-5.5 Batch / Flex	$2.50	$15.00	0.5×
GPT-5.5 Priority	$12.50	$75.00	2.5×
GPT-5.5 Pro	$30.00	$180.00	6× base
GPT-5.4 Standard (reference)	$2.50	$15.00	—

Latency, Speed, Throughput

OpenAI says GPT-5.5 generates tokens at the same per-token latency as GPT-5.4 in real-world serving, despite being a meaningfully larger and more capable model. That isn't the typical generational shape, where a smarter model trades off some throughput.

Per-Token Latency

Real-world serving

Same per-token latency as GPT-5.4.

GPT-5.4

Same

GPT-5.5

Same

Default

1.0×

Per-token, real-world serving.

Codex Fast mode

1.5× faster

1.5× speed at 2.5× rate.

Priority API

2.5×

Standard rate, priority processing.

Latency parity claim from the OpenAI GPT-5.5 launch post. Visualization is editorial; absolute tokens-per-second varies by workload.

The why, from OpenAI: serving GPT-5.5 at GPT-5.4 latency required treating inference as an integrated system rather than a stack of point optimizations. GPT-5.5 was co-designed for and served on NVIDIA GB200 and GB300 NVL72 systems, and the team used Codex itself, plus GPT-5.5, to find and implement key improvements in the serving stack. The model helped tune the infrastructure that serves it.

What this means in practice for the workloads I care about:

Interactive coding loops feel the same. If your IDE integration was tuned around 5.4's per-token cadence, switching to 5.5 doesn't change perceived latency. The wins show up as fewer iterative turns to reach the right answer.
Long-running agents finish sooner end-to-end. Same per-token speed, fewer tokens per task, fewer retries on ambiguous failures. Wall-clock to completion drops.
High-throughput batch jobs see the same throughput per dollar shift. Batch pricing puts 5.5 at GPT-5.4's old standard rate, with the higher capability bundled in.
Codex Fast mode is a real lever. 1.5× tokens-per-second for 2.5× cost. Not a deal in pure cost-per-token terms. A clean win for interactive flows where waiting is what bothers users.

Context Window: 1M vs 400K vs Codex

Both models support a 1M-token context window on the API. Codex tops out at 400K. Two surfaces, two limits, easy to confuse.

Surface	GPT-5.4	GPT-5.5
Responses + Chat Completions API	1M input / 128K output	1M input / 128K output
Codex (Plus/Pro/Business/Enterprise/Edu/Go)	—	400K
ChatGPT	Per-tier limits	Per-tier limits (Pro variant available)

Context behavior at the long end is where 5.5's self-reported numbers really separate from 5.4. On Graphwalks BFS at the >128K bucket, 5.4 dropped to 21.4%; 5.5 reports 73.7% at 256K and 45.4% at 1M. The MRCR v2 8-needle curve, plotted across every context bucket OpenAI publishes, tells the cleaner version of the same story.

Long-Context Recall

MRCR v2 8-needle · GPT-5.5

MRCR v2 8-needle recall, by context bucket.

Best (4-8K)

98.1%

Short-context bucket.

128-256K

87.5%

Typical enterprise long-context.

512K-1M

74.0%

Deepest published bucket.

GPT-5.5 OpenAI MRCR v2 (8-needle) recall, by context-length bucket. Reasoning effort xhigh. Source: OpenAI GPT-5.5 launch post.

If you're running real long-context work on the API (large codebase ingestion, multi-document research, long agent traces), this is where the upgrade earns its sticker. The curve is the difference between "1M context window" as a marketing line and "1M context window" as a working capability.

Benchmark Deltas

Every benchmark below is self-reported and linked to the live LLM Stats leaderboard for that test. Sources: the GPT-5.4 launch post, the GPT-5.5 launch post. Both are run at the "xhigh" reasoning effort tier where reported.

Agentic coding & computer use

Benchmark	GPT-5.4	GPT-5.5	Delta
Terminal-Bench 2.0	75.1%	82.7%	+7.6
SWE-Bench Pro (Public)	57.7%	58.6%	+0.9
OSWorld-Verified	75.0%	78.7%	+3.7
BrowseComp	82.7%	84.4%	+1.7
MCP Atlas	67.2%	75.3%	+8.1
Toolathlon	54.6%	55.6%	+1.0

The Terminal-Bench 2.0 jump (+7.6pp) and MCP Atlas jump (+8.1pp) line up with what OpenAI claimed at launch about agentic coding and tool use. The SWE-Bench Pro delta is small (+0.9pp), but SWE-Bench Pro is the benchmark where most frontier models cluster within a percentage point or two of each other right now; the more useful comparison there is per-task tokens consumed, which 5.5 reports lower across all three of OpenAI's coding evals.

Reasoning, math, science

Benchmark	GPT-5.4	GPT-5.5	Delta
ARC-AGI-1 (Verified)	93.7%	95.0%	+1.3
ARC-AGI-2 (Verified)	73.3%	85.0%	+11.7
GPQA Diamond	92.8%	93.6%	+0.8
FrontierMath (T1-3)	47.6%	51.7%	+4.1
Humanity's Last Exam (no tools)	39.8%	41.4%	+1.6
MMMU Pro (no tools)	81.2%	81.2%	±0

ARC-AGI-2 moves the most: +11.7pp, on a benchmark designed to resist saturation. GPQA Diamond is approaching the ceiling, and MMMU Pro is flat (no-tools); both are workloads where 5.4 was already at the practical capability frontier. FrontierMath gains +4.1pp on tiers 1-3, and a new Tier 4 number (35.4% on 5.5) opens a category that didn't exist as a separate score on 5.4's scorecard.

Specialized agents & long-tail tasks

Benchmark	GPT-5.4	GPT-5.5	Delta
FinanceAgent v1.1	56.0%	60.0%	+4.0
Tau2-bench Telecom	98.9%	98.0%	−0.9

Tau2-bench Telecom is the only regression. It's also already at 98.9% on 5.4, which means there's ~1.1pp of headroom and 5.5 lands inside measurement noise of it. A nominal regression, not a real one.

Benchmarks new on 5.5 (no GPT-5.4 number to compare)

Benchmark	GPT-5.5	Notes
GDPval (wins or ties)	84.9%	OpenAI's economically-weighted task evaluation.
OfficeQA Pro	54.1%	Knowledge-work tasks across business domains.
BixBench	80.5%	Real-world bioinformatics data analysis.
GeneBench	25.0%	Multi-stage genetics analysis under noisy data.
CyberGym	81.8%	Cybersecurity vulnerability reproduction.

What Actually Changed

The benchmark deltas explain the score. The behavioral changes explain why tasks I run regularly feel different on 5.5 even where the benchmark numbers are close.

Token efficiency on long-horizon work

OpenAI's framing: GPT-5.5 is "more efficient in how it works through problems, often reaching higher-quality outputs with fewer tokens and fewer retries." On the three coding evals OpenAI reports, GPT-5.5 improves on GPT-5.4's scores while using fewer tokens. In Codex specifically, OpenAI tuned the experience so 5.5 delivers better results with fewer tokens than 5.4 for most users. This is the single change I notice most often: agentic loops converge sooner and dig themselves in less when they're wrong.

Latency parity with a much larger model

Same per-token latency at much higher capability is how OpenAI is selling the release, and the engineering claim behind it is concrete: rebuilt inference stack, co-design with NVIDIA GB200 and GB300 NVL72 systems, Codex and 5.5 used internally to find and implement key improvements. The model helped improve the infrastructure that serves it.

Codex defaults and Fast mode

GPT-5.5 ships in Codex with a 400K context window across Plus, Pro, Business, Enterprise, Edu, and Go plans. Fast mode is a new lever: 1.5× tokens-per-second for 2.5× cost. The cost arithmetic only works in interactive scenarios where waiting cost > serving cost.

API surface unchanged, modalities unchanged

GPT-5.5 is a drop-in API replacement for 5.4. Same Responses and Chat Completions endpoints, same text and image input modalities, text output, same 1M input / 128K output ceiling. No code change beyond the model ID.

Where GPT-5.5 Pro Fits

GPT-5.5 Pro is for the prompts that can't afford to be wrong on the first try. Same context window. Same modalities. 6× the per-token price of base 5.5 ($30 input, $180 output per 1M). The pitch is more comprehensive, better-structured, more accurate responses on hard single-question work, with especially strong performance in business, legal, education, and data science.

When I'd reach for it: legal review on a long contract, financial analysis where the answer is going into a slide deck, scientific research where the next paper experiment depends on the right interpretation of a dataset. When I wouldn't: anything in a tight latency budget, anything in a tight cost budget, anything where the workflow can iterate and recover.

Variant	Input ($/1M)	Output ($/1M)	Best for
GPT-5.5	$5	$30	Default frontier workload, agentic coding, computer use
GPT-5.5 Pro	$30	$180	Hardest-question accuracy, research-grade analysis

When to Upgrade, When to Stay

Your workload	Recommendation
Agentic coding (Codex, Cursor, Devin-style)	Upgrade to 5.5. Terminal-Bench +7.6pp, MCP Atlas +8.1pp, fewer tokens per finished task. The largest sustained improvement across any single workload.
Computer-use / browser agents	Upgrade to 5.5. OSWorld +3.7pp and BrowseComp +1.7pp, with the bigger win sitting in fewer recovery loops on ambiguous pages.
Long-context / 256K-1M token workloads	Upgrade to 5.5. Graphwalks BFS at 256K jumps from 21.4% (5.4) to 73.7%; MRCR v2 8-needle holds 74.0% at 512K-1M. This is where the upgrade pays for itself most clearly.
Scientific research / quantitative analysis	Upgrade to 5.5; consider 5.5 Pro on hardest tasks. FrontierMath +4.1pp, BixBench at 80.5%, HLE +1.6pp without tools.
High-volume summarization, classification, extraction	Stay on 5.4. 5.4 is already at or near saturation on these tasks. The 2× sticker buys nothing on workloads that aren't using the extra reasoning.
Customer-support style turn-by-turn (saturated benchmarks)	Stay on 5.4. Tau2-bench Telecom at 98.9% on 5.4 regresses to 98.0% on 5.5. Inside noise, but no upgrade case.
Hardest-question single-shot accuracy	Use 5.5 Pro. 6× sticker, but the right tool when the workflow can't recover from a wrong first answer.
Latency-critical interactive coding (IDE)	Upgrade to 5.5; turn on Fast mode in Codex. 1.5× tokens/s for 2.5× cost is worth it for IDE flows where wait time is user-visible.

Primary sources: OpenAI's GPT-5.5 launch post, the GPT-5.4 launch post, and the OpenAI pricing page. For full structured benchmark data, see the GPT-5.5 model page, the GPT-5.4 model page, and the live GPT-5.4 vs GPT-5.5 comparison on LLM Stats.

Questions

Frequently Asked Questions

On the benchmarks both models report, yes. GPT-5.5 wins on 9 of 10 shared evaluations, with the largest jumps on ARC-AGI-2 (+11.7pp), MCP Atlas (+8.1pp), and Terminal-Bench 2.0 (+7.6pp). The single regression is Tau2-bench Telecom (-0.9pp) from a 98.9% saturation point on 5.4. The bigger story sits below the benchmarks: OpenAI says GPT-5.5 uses noticeably fewer tokens to finish the same Codex tasks, and matches GPT-5.4 per-token latency in real-world serving.
GPT-5.5 lists at $5 per 1M input tokens and $30 per 1M output tokens on the standard API. GPT-5.4 sits at $2.50 / $15 per 1M. That is exactly 2× the per-token price, on both sides of the meter. Batch and Flex run at half the standard rate, Priority at 2.5×. OpenAI also offers GPT-5.5 Pro at $30 / $180 per 1M, a 6× step up from base 5.5 for hardest-question accuracy.
No. OpenAI says GPT-5.5 matches GPT-5.4 per-token latency in real-world serving, despite being a meaningfully larger and smarter model. The trick was co-design with NVIDIA GB200 / GB300 NVL72 systems and treating inference as one integrated system. In Codex, an opt-in Fast mode generates tokens 1.5× faster for 2.5× the cost.
On the API, GPT-5.5 supports a 1,000,000-token context window, identical to GPT-5.4, with up to 128K output tokens. In Codex, the surface limit is 400,000 tokens across Plus, Pro, Business, Enterprise, Edu, and Go plans.
For agentic coding, computer use, and long-running reasoning, yes — most workloads will see fewer retries and shorter end-to-end runs even at the higher sticker. For simple summarization, classification, or any high-volume latency-sensitive endpoint that already lands within GPT-5.4's capability envelope, GPT-5.4 stays the better default. The right move on a real codebase is an A/B run on a held-out set with the actual prompts and actual tool calls before flipping the flag.
GPT-5.5 Pro is the higher-accuracy variant, priced at $30 per 1M input and $180 per 1M output. It uses more reasoning compute than base 5.5 and is targeted at the hardest single-shot questions in business, legal, education, and data science. Reach for it when accuracy matters more than throughput and the answer is going to a human or to a downstream pipeline that can't recover from a mistake.
Yes. GPT-5.5 accepts both text and image inputs through the same Responses and Chat Completions APIs that GPT-5.4 used, and outputs text. Audio and video remain out of scope for the base API surface.