AI News 5 min read

Gemini 3.5 Flash: Google's Flash Tier Eats Pro on Agent Benchmarks

Gemini 3.5 Flash outperforms the Pro tier on agent benchmarks with superior speed and efficiency.

May 28, 2026

For two years, the rule was simple: Flash models are cheap. Pro models are smart. Pick your trade-off and pay the bill. Google just broke that rule.

Released at Google I/O on May 19, 2026, Gemini 3.5 Flash is the first Flash-tier model that beats its own Pro counterpart on the benchmarks that actually look like work — agentic coding, tool use, multi-step automation. It is also three times more expensive per token than the Flash model it replaces. That contradiction tells you everything about where Google thinks the market is going.

The benchmarks that matter

Google did not bury the numbers. On Terminal-Bench 2.1, the agentic terminal coding suite, 3.5 Flash hits 76.2% against Gemini 3.1 Pro's 70.3%. On MCP Atlas, the multi-step tool-orchestration benchmark, it scores 83.6% — leading Claude Opus 4.7 by 4.5 points and GPT-5.5 by 8.3.

The pattern repeats across the workloads that translate to dollars: Finance Agent v2 jumps from 43.0% to 57.9%, and GDPval-AA climbs from 1314 Elo to 1656. On output speed, the model runs roughly 4x faster than other frontier models per second.

The Flash brand used to mean "the model you use when latency matters more than answers." With 3.5, it means "the model you use when agents matter more than chat."

It is not a uniform win. On reasoning-only benchmarks where there is no tool loop to exploit, 3.1 Pro still leads. Humanity's Last Exam: 40.2% vs 44.4%. ARC-AGI-2: 72.1% vs 77.1%. If your workload is pure deduction with no environment to act on, Pro is still the call.

The pricing reset nobody saw coming

Here is where the story gets uncomfortable. The new pricing is:

Tier	Input ($/1M tokens)	Output ($/1M tokens)	Cached input
Gemini 3.5 Flash	$1.50	$9.00	$0.15
Gemini 3.1 Pro	$2.00	$12.00	—
Gemini 3 Flash (prior)	$0.50	$3.00	—

Read it twice. 3.5 Flash is 25% cheaper than 3.1 Pro, but roughly 3x more expensive than the Flash model it replaces. Google is not subsidizing Flash anymore — it is repositioning it.

The cached-input rate is the genuine bargain: $0.15 per million tokens, a 90% cut from the standard input rate. For retrieval-augmented workloads where the same system prompt or document context hits the model thousands of times, this is the line item that changes the unit economics.

One more catch worth flagging: Google bills internal thinking tokens at the output rate. A reasoning-heavy job with thinking: high will run up a bill far larger than the sticker price suggests. The new minimal, low, medium, high knobs are not cosmetic — they are cost controls.

The context window is the headline

Specs that matter for builders:

1,048,576 input tokens — over one million in a single call
65,536 maximum output tokens
Knowledge cutoff: January 2025
Full multimodal input: text, image, video, audio
Available in the Gemini app, AI Mode in Google Search, the Gemini API, Google Antigravity, and Vertex AI

A 1M-token context plus a 65K-token output ceiling is what an agent needs to read a real codebase, plan a multi-step refactor, and write back the resulting changes in one round trip. That is what the benchmark numbers reflect — and it is the reason the price climbed.

Who this is actually for

If you are running a chat product on Flash-tier economics, this release stings. The cheap workhorse you priced your unit margins around just tripled in cost. Google's bet is that you will eat it, because the agent capability is now worth more than the cost delta.

If you are building agents — terminal automation, browser control, multi-tool workflows — this is the moment Flash graduates. MCP Atlas at 83.6% means the failure rate on tool sequences just dropped enough to ship products that previously fell apart in production.

If you are doing pure reasoning research, stay on 3.1 Pro until Gemini 3.5 Pro ships — Google has confirmed it is internal-only today and rolling out in June 2026.

The Bottom Line

Gemini 3.5 Flash is not a faster, cheaper Flash. It is a new product category: a Pro-tier agentic model wearing a Flash badge, priced between the two tiers it disrupts. The naming is misleading on purpose. Google wants developers to default to 3.5 Flash for everything that involves tools, and reserve Pro for the long-tail of pure reasoning work.

The real question is what comes next. If 3.5 Flash already beats 3.1 Pro on agent benchmarks, what does 3.5 Pro look like when it lands next month? Either it leaps the frontier again, or it quietly confirms that Flash is now the line that matters — and Pro is the prestige tier nobody actually deploys.

Either answer reshapes the price-per-token wars that defined 2024 and 2025. Watch June.

gemini google ai-models ai-agents benchmarks

More in AI News

AI News

Grok 4.5: xAI's Opus-Class Coder at a Third of the Price

Grok 4.5, released July 8, 2026, is xAI's coding-focused model. It ranks 4th on the Artificial Analysis Intelligence Index (score 54), wins SWE Marathon (29%), and prices at $2/$6 per million tokens with 4.2x better token efficiency than Opus 4.8. Not yet available in the EU.

By Sarah Chen · 5 min · Jul 12, 2026

AI News

GPT-5.6: OpenAI's Sol, Terra, and Luna Go Public

OpenAI made its three-tier GPT-5.6 family (Sol, Terra, Luna) generally available on July 9, 2026 after government safety review. Pricing runs from Luna at $1/$6 to Sol at $5/$30 per 1M tokens, with a Sol Fast option at $12.50/$75 on Cerebras. The release adds Programmatic Tool Calling in the Responses API (63.5% fewer tokens, 50.1% fewer turns) and longer prompt caching, but Sol's 64.6% on SWE-Bench Pro still trails Claude Mythos 5 (80.3%).

By Sarah Chen · 5 min · Jul 11, 2026

AI News

GPT-Realtime-2.1: OpenAI Adds Reasoning to Its Voice API

On July 6, 2026, OpenAI released GPT-Realtime-2.1 and GPT-Realtime-2.1-mini for the Realtime API. The headline change is reasoning in the low-cost mini tier, plus a 25% cut in p95 latency from better caching. The mini holds the prior gpt-realtime-mini price (0 audio in, 0 audio out per 1M) while the full model runs 2/4. Reasoning effort is configurable from minimal to xhigh.

By Sarah Chen · 5 min · Jul 8, 2026