OpenAI just shipped a voice model that thinks before it talks. On May 7, 2026, the company introduced GPT-Realtime-2 alongside two specialized siblings — GPT-Realtime-Translate and GPT-Realtime-Whisper — and quietly took the Realtime API out of beta in the same breath.
The headline is simple: voice agents now have GPT-5-class reasoning, a 128K context window, and the ability to call multiple tools in parallel without going silent on the user. The dead-air problem that has plagued every production voice deployment for the past two years is finally being addressed at the model level.
What Changed Under the Hood
GPT-Realtime-2 is OpenAI's first voice model with GPT-5-class reasoning, which means it can manage interruptions, hold multi-step requests in working memory, and pick up the conversation thread after a tangent. The context window jumped from 32K tokens in GPT-Realtime-1.5 to 128K tokens in this release — a 4x expansion that lets the model carry long support calls or research conversations without forgetting what was said five minutes ago.
The most practically important addition is adjustable reasoning effort. Developers can dial reasoning intensity across five levels:
| Reasoning Level | Use Case |
|---|---|
minimal |
Quick lookups, single-turn replies |
low (default) |
Standard customer support |
medium |
Mild multi-step workflows |
high |
Complex troubleshooting |
xhigh |
Travel booking, financial reasoning |
The default is low to keep latency tight. The flexibility matters because a "what's my balance?" question doesn't need the same compute as "rebook all three legs of my flight if the second one delays past 7 PM."
The model can also call multiple tools at once and narrate what it's doing while it does — so instead of dead air during a multi-step task, the user gets a running commentary.
You can configure preamble phrases like "let me check that" or "one moment while I look into it" so the agent never sounds frozen during a long tool call.
The Benchmark Numbers
OpenAI's own evals show meaningful gains over the previous generation:
| Benchmark | GPT-Realtime-1.5 | GPT-Realtime-2 |
|---|---|---|
| Big Bench Audio (high reasoning) | 81.4% | 96.6% |
| Audio MultiChallenge (xhigh) | 34.7% | 48.5% |
That 15.2 percentage point Big Bench Audio jump is the kind of generational delta you used to see between full model versions, not point releases. Audio MultiChallenge tests multi-turn instruction following, context integration, and how gracefully a model handles natural speech corrections — i.e., "wait, actually book the 7 PM flight instead." A 13.8-point bump there is the closest thing to a "voice agents now work" announcement OpenAI has ever shipped.
GPT-Realtime-Translate and GPT-Realtime-Whisper
The two companion models are deliberately not conversational. They are pipes.
GPT-Realtime-Translate translates speech from 70+ input languages into 13 output languages while keeping pace with the speaker. It is not a chatbot. It does not call tools. It converts one audio stream into another in real time. The use case is live interpretation — bilingual customer support, on-stage translation, multilingual sales calls.
GPT-Realtime-Whisper is streaming speech-to-text with controllable latency. The original Whisper was built for completed audio chunks, which made it great for post-call transcription and bad for live captions. The new model is the streaming counterpart — lower delay settings produce earlier partial text, higher settings improve transcript quality.
The split is intentional. If you need a reasoning agent, use GPT-Realtime-2. If you need a translation pipe, use Translate. If you need live captions, use Whisper. OpenAI is explicitly pushing developers away from one-size-fits-all and toward purpose-built session types.
What It Costs
Pricing is where the strategy gets interesting:
- GPT-Realtime-2: $32 per 1M audio input tokens, $0.40 per 1M cached input tokens, $64 per 1M audio output tokens
- GPT-Realtime-Translate: $0.034 per minute
- GPT-Realtime-Whisper: $0.017 per minute
The cached input price — 80x cheaper than fresh input — is the real tell. OpenAI is signaling that long, contextful voice sessions are the intended deployment pattern, not one-shot calls. Build a system prompt once, cache it, and pay almost nothing for every subsequent turn that re-reads it.
The per-minute pricing on Translate and Whisper is a different play entirely: a flat predictable cost that competes head-on with ElevenLabs, Deepgram, and the rest of the voice-infrastructure market.
Two New Voices: Cedar and Marin
On the output side, OpenAI added two new voices — Cedar and Marin — available exclusively through the new models. They join the existing roster but aren't backported to older versions, which is a soft push toward migrating.
The Realtime API officially exits beta and is now generally available. This is the part developers who held off on building production voice agents have been waiting for. GA means SLAs, predictable deprecation policies, and the kind of stability you can build a business on.
Why This Matters
Voice agents in 2025 had a credibility problem. They could answer questions, but they couldn't reason through anything. Ask one to "compare the cancellation policies on these three flights and pick the most flexible one," and you got either a polite refusal or 15 seconds of silence followed by a wrong answer.
GPT-Realtime-2 doesn't fix all of that, but it fixes the most visible failure modes: the dead air, the dropped context, the inability to call more than one tool per turn. Combined with the 128K window and the five-tier reasoning dial, this is the first voice model that feels production-credible for workflows beyond simple Q&A.
The competitive pressure on ElevenLabs, Vapi, and Retell just went up considerably. Those companies built their value props around stitching together third-party LLMs and TTS engines into something that felt cohesive. OpenAI just shipped the cohesive thing as a single API call.
The Bottom Line
GPT-Realtime-2 is the first voice model from OpenAI that earns the word "agent" without scare quotes. The 128K context, GPT-5-class reasoning, parallel tool calls, and tone control are individually useful — together they make the dead-air, forgetful, single-tool voice agents of 2025 feel obsolete. Pair it with Translate and Whisper for the rest of the audio pipeline, and the whole stack now comes from one vendor at one set of prices. If you've been waiting to build a serious voice product, the API just stopped being the reason to wait.


