GPT-5.5: OpenAI's First Full Retrain Since GPT-4.5 Bets on Agents
AI News 5 min read intermediate

GPT-5.5: OpenAI's First Full Retrain Since GPT-4.5 Bets on Agents

GPT-5.5 hits 82.7% on Terminal-Bench 2.0 with a 1M context window and powers OpenAI's coming Super App.

Sarah Chen
Sarah Chen
May 5, 2026

GPT-5.5: OpenAI's First Full Retrain Since GPT-4.5 Goes All-In on Agents

OpenAI's newest flagship is the company's first fully retrained base model since GPT-4.5 — and the pitch isn't a marginal benchmark win. It's a model built from the ground up for agents that operate computers, ship code, and run knowledge work end-to-end.

GPT-5.5 shipped on April 23, 2026 in the API, Codex, and ChatGPT simultaneously. President Greg Brockman framed it as another step toward what OpenAI is now openly calling its "Super App" — a single desktop application bundling ChatGPT, Codex, and a dedicated browser. The model is the engine; the Super App is the chassis.

A retraining, not a refresh

Most of the GPT-5.x line was built by post-training on top of an existing base. GPT-5.5 starts over. OpenAI describes it as the smartest and most intuitive model the lab has produced, and the architecture is explicitly tuned for long-horizon agent work rather than single-turn answers.

That shows up most clearly in coding. With GPT-5.5, Codex can now interact with web apps, test flows, click through pages, capture screenshots, and iterate on what it sees — meaning the agent loop extends well past the terminal. Early users describe it as the first model that holds its own across a 50-file refactor without losing the plot.

"GPT-5.5 is better at the behaviors real engineering work depends on — holding context across large systems, reasoning through ambiguous failures, and carrying changes through the surrounding codebase." — OpenAI

The benchmark numbers that matter

Headline scores, all reported by OpenAI on launch day:

Benchmark GPT-5.5 GPT-5.4 Claude Opus 4.7
Terminal-Bench 2.0 82.7% 75.1% 69.4%
GDPval (44 occupations) 84.9%
OSWorld-Verified 78.7%
τ2-bench Telecom 98.0%

Three of these aren't multiple-choice quiz benchmarks — they're agent simulations. OSWorld-Verified drops the model into a real computer environment and asks it to complete tasks. τ2-bench Telecom runs full customer-service workflows with no prompt tuning. GDPval tests well-specified knowledge work across 44 real occupations.

The Terminal-Bench 2.0 jump is the headline. GPT-5.5's 82.7% beats Claude Opus 4.7 by more than 13 points on a benchmark explicitly designed to measure how well a model can finish multi-step terminal tasks autonomously.

Pricing: a real input-cost increase

The bill of materials gets more expensive.

  • Standard API: $5 per 1M input tokens / $30 per 1M output tokens2× the input price of GPT-5.4 Standard.
  • GPT-5.5 Pro: $30 / $180 per 1M tokens — same headline rate as GPT-5.4 Pro.
  • Fast mode: 1.5× faster generation at 2.5× the cost.
  • Context window: 1,050,000 tokens with up to 128,000 output tokens. In Codex, the context is capped at 400K tokens but available across Plus, Pro, Business, Enterprise, Edu, and Go plans.

The 2× input price increase is the part that will bite production deployments. For a 1M-token Standard API call, you're now paying $5 in input alone before the model writes a single token of output. OpenAI is betting the agent gains are worth it — and on Terminal-Bench, they probably are.

What "Super App" actually means

GPT-5.5 is the first model OpenAI is publicly tying to its desktop strategy. The Super App combines:

  • ChatGPT — the conversational front door
  • Codex — agentic coding with web app interaction
  • A dedicated browser — an OpenAI-controlled environment for the agent to operate in

This is a direct shot at Microsoft 365 Copilot, Anthropic's computer use, and the long tail of agent startups building on top of OpenAI's own API. If you can do agentic computer use natively in a first-party desktop app with GPT-5.5 as the brain, the value of building yet another wrapper drops fast.

What's missing

OpenAI didn't publish a paper on training compute, dataset composition, or the new pre-training mixture. No discussion of multimodal benchmarks beyond OSWorld. The Super App was teased but isn't shipping yet — and even when it does, it'll likely be Plus-and-up only.

GPT-5.5's reasoning gains also come at a latency cost in default mode. If you need fast responses, Fast mode exists, but you're paying 2.5× for the privilege.

The Bottom Line

GPT-5.5 is a clear step up where it matters most for the next 12 months: agents that finish work without supervision. The Terminal-Bench 2.0 lead over Claude Opus 4.7 isn't subtle, and the OSWorld-Verified score makes a credible case that the model can actually drive a computer.

The harder question is whether the Super App strategy works. OpenAI is now building both the model and the chassis it runs in — and starting to compete directly with the developers paying for its API. That dynamic, more than the benchmarks, is what will define GPT-5.5's first year.