AI News 5 min read intermediate

DeepSeek V4-Pro: 75% Price Cut Becomes Permanent

On May 22, 2026, DeepSeek made its 75% promotional discount on V4-Pro permanent rather than letting it expire May 31. New permanent rates: $0.435/M input, $0.87/M output, $0.003625/M cache hit. That puts V4-Pro output roughly 34x cheaper than GPT-5.5 and 17x cheaper than Claude Opus 4.7, while landing within 3-7 points on coding and reasoning benchmarks. The underrated detail is the cache-hit price, which can cut input cost ~88% for agents with stable prefixes. Teams should re-run their build math and route the easy majority of traffic to V4-Pro.

Sarah Chen

Jun 1, 2026

On May 22, 2026, DeepSeek did something the rest of the industry only does with footnotes and asterisks: it made a steep discount permanent. The 75% promotional price cut on DeepSeek-V4-Pro, originally set to expire on May 31 at 15:59 UTC, will not roll back. The promo rate is now the list rate.

That turns a temporary headline into a structural one. And it resets the math for anyone shipping a product that calls a frontier model in a hot path.

What actually changed

DeepSeek launched V4-Pro on April 24, 2026 at $1.74 per million input tokens and $3.48 per million output tokens. It then ran an aggressive 75% promo through May. Instead of letting the rate snap back on June 1, the team locked the discount in.

Here is the new permanent price sheet, per one million tokens:

Token type	Old list	New permanent	Cut
Input (cache miss)	$1.74	$0.435	75%
Input (cache hit)	$0.0145	$0.003625	75%
Output	$3.48	$0.87	75%

The output line is the one that lands on your invoice hardest, because output tokens dominate any agent loop where the model reasons or writes code. Going from $3.48 to $0.87 is not a coupon. It is a new floor.

DeepSeek didn't drop the price. It redrew the curve. Sub-dollar output pricing at the frontier tier is now the baseline, not the outlier.

How it compares to the rest of the shelf

The interesting comparison is not V4-Pro against its old self. It is V4-Pro against everything else on the frontier shelf in mid-2026.

Model	Input ($/MTok)	Output ($/MTok)	SWE-bench Pro
DeepSeek-V4-Pro (new)	$0.435	$0.87	55.4%
GPT-5.5	$5.00	$30.00	58.6%
Claude Opus 4.7	$3.00	$15.00	~62%
DeepSeek-V4-Flash	$0.14	$0.28	~42%

On output tokens, V4-Pro is roughly 34x cheaper than GPT-5.5 and 17x cheaper than Claude Opus 4.7. On capability, it lands within three to seven percentage points of GPT-5.5 on most public coding and reasoning evals. For a large share of production traffic, that gap is invisible to end users and very visible on the bill.

The detail most coverage misses

Everyone quotes the $0.87 output number. Almost nobody explains the cache-hit price of $0.003625 per million tokens — a 90% cut layered on top of the headline cut, and a separate change that took effect April 26.

DeepSeek's prompt cache fires when the prefix of your request is byte-identical to a recent one. For chat agents and retrieval pipelines, that prefix is your system prompt, tool definitions, and instruction scaffolding — typically thousands of tokens that never change between turns. The ratio of input-miss to input-hit is roughly 120:1.

In practice, an assistant with a 6,000-token system prompt handling 100,000 turns a day pays about $270 a day on input without caching. With 90% of those prefix tokens hitting cache, the same workload pays closer to $32 a day. That is an 88% reduction on input cost alone — the difference between the model being a sustainable line item and a luxury one.

Three patterns capture the savings:

Pin the prefix. Keep the system prompt, tool schemas, and few-shot examples in one block at the start of every request.
Stabilize dynamic context. Sort or hash retrieved chunks so small reordering doesn't break the cache.
Warm up on startup. Send one full-prefix request before user traffic arrives.

Why permanent matters more than cheap

A temporary discount is a marketing event. A permanent one is a statement about unit economics. DeepSeek's own framing is that V4-Pro was engineered to cut long-context inference cost — reportedly around a quarter of the per-token compute and a fraction of the memory footprint of its predecessor at long context. That is why the cut sticks: it reflects an efficiency gain being passed through, not margin being burned indefinitely.

The competitive signal is louder than the price itself. 2026 has been a year of margin compression across the board, but most cuts targeted budget tiers. V4-Pro's cut targets the frontier capability band, which is exactly why it reset expectations and the others didn't.

What to do this week

You don't need to migrate everything. You need to route intelligently.

Measure your output-to-input ratio. If output dominates, the savings are large.
Run a 100-sample eval on real traffic, not public benchmarks. Most teams find V4-Pro is good enough for 70–85% of requests.
Route by difficulty. Send the easy majority to V4-Pro, keep a premium model on the hard tail. That captures most of the savings with near-zero quality regression.
Lock your cache prefixes regardless of which model wins.

The Bottom Line

DeepSeek made frontier-tier capability cost less than a dollar per million output tokens — and promised it isn't going away. If you priced an AI feature against GPT-5.5 or Claude Opus 4.7 last quarter and shelved it on cost grounds, the budget you penciled in probably overstates your needs by roughly 4x. The promo flag came off. The discount didn't. The reasonable move is to re-run the build math this week, not next quarter.

deepseek llm ai-models developer-tools benchmarks

More in AI News

AI News

Etched: The $5B Sohu Chip Betting the Transformer Never Dies

Etched, a startup building the transformer-only Sohu inference ASIC, has booked over $1 billion in contracts and reached a $5 billion valuation, with reports of new rounds valuing it up to $20 billion. Sohu hard-wires the transformer graph into silicon on TSMC N4P with 144GB HBM3E, and Etched claims an 8-chip server exceeds 500,000 Llama 70B tokens/sec. No independent benchmarks exist yet.

By Sarah Chen · 5 min · Jul 25, 2026

AI News

Project Perception: Microsoft's Cheaper Rival to Claude Mythos

Microsoft is reportedly developing Project Perception, a multi-model AI security platform that routes vulnerability-scanning tasks across models from Microsoft, OpenAI, and Anthropic to reserve expensive frontier calls for high-value steps. Its pitch is matching Anthropic's Claude Mythos on capability while costing far less. Microsoft has not officially confirmed details, so the news should be treated as a credible report pending benchmarks.

By Sarah Chen · 5 min · Jul 21, 2026

AI News

Inkling: Mira Murati's Thinking Machines Ships Its First Open Model

Thinking Machines Lab, founded by ex-OpenAI CTO Mira Murati, released Inkling on July 15, 2026 — an open-weight mixture-of-experts model with 975B total parameters (41B active), trained on 45 trillion multimodal tokens. The company openly says it isn't the strongest model available; instead it's a customizable foundation enterprises fine-tune via the Tinker platform. The release doubles as an argument that owned, adaptable models beat rented one-size-fits-all APIs.

By Sarah Chen · 5 min · Jul 18, 2026