On April 23, 2026, OpenAI released GPT-5.5 — its first retrained base model since GPT-4.5 — and immediately drew a line in the sand: this is not just an upgrade. It’s a redefinition of what it means for an AI model to do work. The model scored 82.7% on Terminal-Bench 2.0, a leap from GPT-5.4’s 75.1%, and hit 74.0% on MRCR v2 at one million tokens, more than double its predecessor’s 36.6%. But here’s the catch: it costs twice as much per million tokens.
Key Takeaways
- GPT-5.5 is OpenAI’s first retrained base model since GPT-4.5, co-designed with NVIDIA’s GB200 and GB300 NVL72 systems.
- It leads on Terminal-Bench 2.0 (82.7%) and SWE-Bench Pro (58.6%), with 73.1% on the internal Expert-SWE benchmark.
- API pricing doubled: $5 input / $30 output per million tokens, though OpenAI claims 20% lower effective cost due to efficiency gains.
- GPT-5.5 Pro costs $30/$180 per million tokens and leads BrowseComp at 90.1%, but lags on MCP Atlas where no score was reported.
- Independent lab Artificial Analysis confirmed OpenAI’s token-efficiency claims — but only under controlled conditions.
The Agentic Benchmark Surge
OpenAI didn’t just tweak the weights on GPT-5.5. It rebuilt the model from the ground up to operate as an agent — one that plans, uses tools, verifies its own output, and persists across steps without constant human nudging. That shift shows up in the numbers.
The most telling benchmark is Terminal-Bench 2.0, which tests command-line workflows in a sandboxed environment. These aren’t simple queries. They involve chaining commands, parsing outputs, and adapting when permissions or dependencies fail. GPT-5.5 scored 82.7%, a significant jump from GPT-5.4’s 75.1% and far ahead of Claude Opus 4.7’s 69.4%. That 7.6-point gain isn’t noise — it’s the difference between a model that needs constant supervision and one that can run a deployment script start to finish.
Then there’s SWE-Bench Pro, the gold standard for AI coding ability. GPT-5.5 solved 58.6% of GitHub issues in a single pass — up from 52.1% in the prior version. That might not sound dramatic, but in practice, it means fewer retries, less context bloat, and less engineering time spent babysitting the model.
Expert-SWE: A Benchmark Built for Humans
More interesting is Expert-SWE, an internal OpenAI benchmark where tasks have a median estimated human completion time of 20 hours. These aren’t bug fixes. They’re full-stack refactors, CI/CD pipeline redesigns, architectural migrations. GPT-5.5 scored 73.1% here, up from 68.5%. That’s not full autonomy — but it’s close enough to shift how engineering teams allocate work.
What’s striking isn’t just the score. It’s that OpenAI now treats 20-hour tasks as a baseline for evaluation. That signals a quiet pivot: the company no longer sees AI as a copilot. It sees it as a junior engineer with tenure.
The Million-Token Blind Spot
Then there’s MRCR v2 — a retrieval benchmark at one million tokens. The task? Find a single sentence buried in a document the length of four novels. GPT-5.5 scored 74.0%. GPT-5.4? 36.6%. That’s not incremental. That’s a breakthrough in long-context reasoning.
But here’s what OpenAI doesn’t say: MRCR v2 is synthetic. The documents are structured. The needle is planted cleanly. Real-world documents — messy PDFs, scanned contracts, nested JSON logs — don’t play nice. And no public benchmark captures that friction.
Which makes the omission on MCP Atlas glaring. Scale AI’s tool-use benchmark tests how well models coordinate with external systems — APIs, databases, auth flows. Claude Opus 4.7 scores 79.1%. GPT-5.5? No score. Not even a placeholder.
OpenAI includes that blank cell in its own benchmark table. That’s either confidence or evasion. Maybe both. But for developers building agents that need to book flights, pull CRM data, or authenticate to legacy systems, that gap matters. An agent that can’t use tools reliably isn’t an agent. It’s a very expensive autocomplete.
Twice the Price, 20% Better?
On April 24, API access went live. Input tokens: $5 per million. Output: $30. That’s exactly double GPT-5.4’s rates. GPT-5.5 Pro? $30 input, $180 output. At 10 million output tokens per month, that’s $300 — 20% more than Claude Opus 4.7’s $250.
OpenAI’s defense? Efficiency. The model completes the same Codex tasks with fewer tokens. Fewer retries. Fewer iterations. The company claims effective costs rise only 20%, not 100%. And Artificial Analysis, an independent lab, says that’s roughly accurate — under lab conditions.
But real workloads aren’t labs. They’re unpredictable. They involve legacy systems, flaky APIs, edge cases no benchmark captures. And that 20% effective cost increase assumes peak performance. What happens when the model hits a snag? When it loops? When it hallucinates a CLI command that wipes a staging server?
- Standard GPT-5.5: $5/$30 per million tokens (input/output)
- GPT-5.5 Pro: $30/$180 per million tokens
- Claude Opus 4.7: $2.50/$12.50 per million tokens
- At 10M output tokens/month: GPT-5.5 = $300, Claude = $250
- Efficiency gain: ~20% fewer tokens per task (per Artificial Analysis)
The math only works if your use case aligns perfectly with the benchmarks — terminal workflows, GitHub issues, clean retrieval. If you’re building customer support bots, content summarizers, or translation pipelines, the premium may not pay off.
The Pro Tier: Parallel Compute as a Crutch
GPT-5.5 Pro isn’t just a bigger model. It applies additional parallel test-time compute on harder problems. That means when the model hits a wall, it doesn’t give up. It spins up extra compute to brute-force a solution.
That’s why it leads OpenAI’s BrowseComp benchmark at 90.1% — a test of agentic web browsing that requires navigating dynamic pages, filling forms, and extracting structured data. But that performance comes at a cost: $180 per million output tokens. At that rate, a single complex task could cost cents — trivial for prototypes, catastrophic at scale.
This isn’t intelligence. It’s compute use. And it suggests OpenAI’s confidence in GPT-5.5’s native reasoning is… qualified. If the model could solve these problems efficiently on its own, it wouldn’t need to burn money to get there.
Real-World Workloads Will Decide
Benchmarks are guides, not guarantees. The real test is how GPT-5.5 performs in production — under latency constraints, cost ceilings, and real user demands.
One early adopter, a fintech startup using GPT-5.5 for internal tool automation, reported a 35% drop in task completion time but a 60% increase in API spend. Another, a devtools company, saw only marginal gains in bug-fix accuracy but a spike in token usage due to verbose planning steps.
That’s the risk: OpenAI optimized GPT-5.5 for benchmarks that reward long chains of thought, tool use, and self-correction. But those behaviors consume tokens. And when you’re paying per output, verbosity isn’t intelligence. It’s waste.
What This Means For You
If you’re a developer, founder, or tech lead, GPT-5.5 demands a new calculus. You can’t just switch models and hope for savings. You need to audit your workflows: which tasks actually benefit from agentic behavior? Which ones just need fast, cheap completion?
For GitHub issue resolution, terminal automation, or long-context retrieval, GPT-5.5 may justify its cost. But for general chat, summarization, or lightweight coding, sticking with GPT-5.4 or switching to Claude Opus 4.7 could save real money. And if your agents rely on external tools, MCP Atlas’s missing score should give you pause. Don’t assume compatibility. Test it.
OpenAI is betting that smarter, more autonomous models will eventually pay for themselves in reduced human labor. But right now, the savings are theoretical. The costs are real.
Here’s the real question: if GPT-5.5 needs twice the price and parallel compute to lead benchmarks, how much of that progress is architecture — and how much is just brute force?
Sources: AI News, Artificial Analysis


