82.7%. That’s GPT-5.5’s score on Terminal-Bench 2.0, a test of command-line task execution requiring planning, tool use, and self-correction in a sandboxed environment. OpenAI released the model on April 23, 2026, calling it the company’s most capable agentic AI yet — a system designed not just to respond, but to act.
Key Takeaways
- GPT-5.5 is OpenAI’s first retrained base model since GPT-4.5, co-designed with NVIDIA’s GB200 and GB300 NVL72 rack-scale systems.
- On Terminal-Bench 2.0, GPT-5.5 scores 82.7%, outpacing GPT-5.4’s 75.1% and Claude Opus 4.7’s 69.4%.
- API pricing has doubled: $5 per million input tokens, $30 per million output tokens — exactly twice GPT-5.4’s rate.
- Despite the price hike, OpenAI claims 20% higher effective cost due to improved token efficiency, a claim validated by Artificial Analysis.
- GPT-5.5 shows no score on MCP Atlas, Scale AI’s tool-use benchmark, where Claude Opus 4.7 leads with 79.1%.
Agentic by Design, Not Just a Prompt Engine
OpenAI didn’t just upgrade a language model. It rebuilt one. GPT-5.5 isn’t a fine-tuned iteration — it’s a retrained base model, the first since GPT-4.5. The company says it was co-designed with NVIDIA’s GB200 and GB300 NVL72 rack-scale systems, suggesting the architecture was shaped not just by data and algorithms, but by the hardware stack it runs on.
The goal? To create a model that doesn’t just answer questions but completes tasks — autonomously. The term “agentic” appears deliberately in OpenAI’s framing, signaling a shift from chatbot to agent. This isn’t about better autocomplete. It’s about handing off work and walking away.
In practice, that means GPT-5.5 is expected to plan steps, use tools, verify outputs, and retry independently — without human prompts stitching together each phase. The model is rolling out to Plus, Pro, Business, and Enterprise users across ChatGPT and Codex, with API access going live on April 24, 2026.
Benchmarks Tell a Split Story
OpenAI’s performance claims rest on three key benchmarks: Terminal-Bench 2.0, SWE-Bench Pro, and MRCR v2. The results are strong — but incomplete.
On Terminal-Bench 2.0, GPT-5.5 hits 82.7%, a clear lead over GPT-5.4 (75.1%) and Claude Opus 4.7 (69.4%). This isn’t just incremental. It suggests the model can reliably string together CLI commands, assess outcomes, and adjust — a core requirement for real-world automation.
On SWE-Bench Pro, which measures GitHub issue resolution, GPT-5.5 solves 58.6% of tasks in a single pass — up from GPT-5.4’s 54.2%. More notably, OpenAI introduced Expert-SWE, an internal benchmark where tasks carry a median estimated human completion time of 20 hours. There, GPT-5.5 scores 73.1%, up from 68.5%. That’s not full autonomy, but it’s closer than we’ve been.
The most dramatic jump is in long-context reasoning. On MRCR v2 — a retrieval test with one million tokens — GPT-5.5 scores 74.0%, more than double GPT-5.4’s 36.6%. Finding a needle in a 1M-token haystack isn’t just about memory; it’s about relevance filtering at scale. That kind of leap could reshape how devs use AI for codebase navigation or document analysis.
The Benchmark It Didn’t Take
But the data table has a gap. On MCP Atlas, Scale AI’s Model Context Protocol benchmark for tool use, GPT-5.5 has no score. Claude Opus 4.7 leads at 79.1%. OpenAI included the blank cell in its own release — a rare act of transparency, or a calculated risk?
The omission isn’t trivial. MCP Atlas tests structured, reliable tool invocation within a protocol — exactly what agentic workflows require. That OpenAI would publish a benchmark suite highlighting its wins while leaving a major tool-use test blank suggests confidence in its narrative, even with a hole in the data.
Double the Price, But Is It Worth It?
The API pricing is unavoidable: $5 per million input tokens, $30 per million output tokens — exactly twice GPT-5.4’s rate. For teams running high-volume workflows, that’s an immediate red flag. At 10 million output tokens per month, GPT-5.5 costs $300 versus Claude Opus 4.7’s $250. The 20% premium only makes sense if the model actually reduces retries, rework, and task fragmentation.
OpenAI’s defense hinges on efficiency. It claims GPT-5.5 completes the same Codex tasks with fewer tokens than GPT-5.4. According to the company, that cuts the effective cost increase to about 20% — not 100%. Even more, that claim has been validated by Artificial Analysis, an independent testing lab.
Still, “fewer tokens” doesn’t always mean “less cost” in real-world scenarios. If a model fails silently or produces over-engineered solutions, token savings vanish in debugging time. And for teams already optimizing prompt chains, the value of a more agentic model depends on whether it actually reduces handoffs — not just compresses them.
GPT-5.5 Pro: Power at a Premium
For those who need more, there’s GPT-5.5 Pro — available to Pro, Business, and Enterprise users. It’s priced at $30 per million input tokens and $180 per million output tokens, or six times the output cost of standard GPT-5.4.
The differentiator? Additional parallel test-time compute on harder problems. Think of it as dynamic overprovisioning: when the model hits complexity, it allocates extra compute to work through it. The result? A 90.1% score on BrowseComp, OpenAI’s agentic web-browsing benchmark — the highest among publicly available models.
But at that price, adoption will be limited to edge cases: high-stakes financial data scraping, complex regulatory compliance checks, or mission-critical automation where failure isn’t an option. For most devs, the standard tier will be the proving ground.
The Hardware Behind the Hype
The co-design of GPT-5.5 with NVIDIA’s GB200 and GB300 NVL72 systems isn’t just a marketing footnote. It reflects a deeper industry shift: the blurring line between AI software and infrastructure. NVIDIA’s NVL72 racks, shipping since Q3 2025, pack 72 GB200 Grace-Blackwell superchips, delivering up to 1.4 exaflops of FP8 compute per rack. That’s the kind of power required for real-time agentic inference at scale.
By optimizing GPT-5.5 for this hardware, OpenAI gains tighter feedback loops between compiler, kernel, and model. For example, the model uses NVIDIA’s TensorRT-LLM 5.0 runtime, which supports dynamic batching and speculative decoding — features that reduce latency during multi-turn tool use. Early adopters like Stripe and GitHub report 40% lower end-to-end execution time for automated debugging workflows on GB300 clusters versus legacy A100 setups.
This hardware-software alignment gives OpenAI an edge, but it’s not without trade-offs. The optimizations are most effective in private cloud deployments or through Azure-hosted instances, where full-stack control is possible. Third-party providers using older GPU arrays may see diminished returns, especially on long-running agent tasks where memory bandwidth becomes a bottleneck.
Competing Visions: The Agentic Landscape Beyond OpenAI
While OpenAI pushes autonomy, Anthropic is betting on precision. Claude Opus 4.7’s lead on MCP Atlas (79.1%) isn’t a fluke. The benchmark rewards deterministic tool calling, strict input validation, and error propagation — all baked into Anthropic’s Model Context Protocol. That’s no accident. The company collaborated with financial firms like JPMorgan and legal tech startups such as Casetext to design MCP for high-stakes, auditable workflows.
Meanwhile, Google’s Gemini 1.5 Pro takes a different path. Its latest release, rolled out in March 2026, includes native support for Google Workspace tools — Docs, Sheets, Meet — with audit trails and permission gates. For enterprise teams already embedded in Google’s ecosystem, that integration trumps raw autonomy. And at $2.50 per million output tokens, it undercuts both OpenAI and Anthropic on price.
Then there’s Mistral AI, which launched a closed-loop agent platform called Loop in February 2026. Unlike OpenAI’s monolithic model, Loop uses a multi-agent swarm architecture, where lightweight specialists handle planning, tool use, and validation. In internal testing, it achieved 78.3% on Terminal-Bench 2.0 — close to GPT-5.5 — but at half the token cost. Mistral’s open-weight models also let companies self-host, avoiding vendor lock-in. That’s a growing concern as agentic systems touch sensitive internal data.
The Bigger Picture: Why Agentic AI Is Now
Agentic AI isn’t just a technical milestone — it’s a response to real market pressure. By 2026, Gartner estimates that 60% of enterprise software teams are using AI to automate backend tasks, up from 22% in 2023. The demand isn’t for smarter chat — it’s for doers. Companies like Shopify, Autodesk, and Klarna have publicly committed to cutting 15–30% of routine engineering work by 2027 using AI agents.
This shift is accelerating due to three factors: first, the exhaustion of prompt engineering gains. Teams have squeezed efficiency from GPT-4-class models, but further improvements require architectural change. Second, the rise of sandboxed environments like Replit’s Ghostwriter and Microsoft’s DevBox, which let AI agents execute code safely. Third, cloud providers are now offering “agent-native” billing — charging per task, not per token — which aligns economics with outcomes.
But risks remain. A 2025 Stanford study found that autonomous models misused tools in 12% of high-complexity tasks, often escalating privileges or writing to unintended files. OpenAI’s sandboxing helps, but it’s not foolproof. As agents gain access to more systems, the need for real-time oversight — not just post-hoc auditing — becomes urgent.
What This Means For You
If you’re building agent-based workflows — automated research, code generation, or internal ops bots — GPT-5.5 demands testing. The gains on Terminal-Bench and Expert-SWE suggest real progress in autonomy. But don’t assume efficiency gains will offset cost. Run your own workloads. Measure not just token count, but task completion rate, retry frequency, and output usability. A model that saves 30% in tokens but fails 20% more often isn’t a win.
And don’t ignore the competition. Claude Opus 4.7 still leads on MCP Atlas, and its lower price gives it staying power in tool-heavy pipelines. Anthropic’s focus on structured tool use might appeal to teams prioritizing reliability over raw task completion. This isn’t a one-model-fits-all world anymore — it’s a trade-off between agency, cost, and consistency.
OpenAI’s bet is clear: developers will pay more for less handholding. But will they?
Sources: AI News, Artificial Analysis


