AI News 6 min read

Claude Opus 4.7: Anthropic's New Flagship Clears SWE-Bench Pro

Anthropic's Claude Opus 4.7 excels on SWE-bench Pro with enhanced vision and new features.

Apr 19, 2026

Claude Opus 4.7: Anthropic's New Flagship Clears SWE-Bench Pro

Anthropic shipped Claude Opus 4.7 on April 16, 2026, and the headline number is hard to argue with: 64.3% on SWE-bench Pro, up from 53.4% on Opus 4.6. That's not a tuning-round improvement. It's the kind of jump that makes you re-read the eval card to make sure you haven't misread the x-axis.

This is Anthropic's direct answer to GPT-5.4 and Gemini 3.1 Pro — both of which briefly held the "most capable publicly available LLM" crown over the past two months. With Opus 4.7, Anthropic narrowly takes it back. The catch: the company also publicly conceded that its unreleased Claude Mythos Preview still beats 4.7 across the board. Anthropic is holding Mythos back over cybersecurity concerns, a story we covered in Claude Mythos.

What's actually new in Opus 4.7

Let's skip the marketing and look at what changed at the model level.

Coding just got measurably better. On SWE-bench Verified, Opus 4.7 hits 87.6%, up from 80.8% on Opus 4.6. Cursor CEO Michael Truell said the jump on their internal CursorBench was 70% vs. 58% — an entire class of tasks moved from "partially reliable" to "you can just ship it." Rakuten reported Opus 4.7 resolves 3x more production tasks than 4.6 on Rakuten-SWE-Bench, with double-digit gains in code and test quality.

Vision went from "fine" to "actually useful." Opus 4.7 can now accept images up to 2,576 pixels on the long edge (~3.75 megapixels), more than three times the prior Claude ceiling. XBOW, which uses Claude for autonomous pentesting, reported Opus 4.7 scoring 98.5% on their visual-acuity benchmark versus 54.5% for Opus 4.6. That's not an incremental bump — that's a regime change for any workflow that depends on reading dense screenshots or diagrams.

There's a new effort level: xhigh. It sits between high and max, giving developers a finer knob for trading latency against reasoning depth. In Claude Code, xhigh is now the default on all plans. Anthropic explicitly recommends starting there for coding work.

Task budgets are in public beta on the API. Developers can now cap token spend at the task level so Claude prioritizes effort across longer runs instead of burning budget on the first sub-goal.

The tokenizer change nobody should ignore

Pricing is unchanged — $5 per million input tokens, $25 per million output tokens, the same as Opus 4.6. But there's a subtle catch buried in the migration guide: Opus 4.7 uses an updated tokenizer that maps the same input to 1.0–1.35x more tokens depending on content type.

That means your real bill can rise even though the per-token rate didn't. Anthropic argues the net effect is favorable because 4.7 thinks more efficiently at each effort level, but they explicitly tell teams to "measure the difference on real traffic." That's good advice. A Finout cost analysis flagged the same issue — the "unchanged price tag" hides a real cost shift for high-volume use.

If you're migrating a production workload, run a shadow evaluation first. Don't flip the model string blind.

Instruction following: a double-edged upgrade

Here's a quiet footnote from Anthropic's announcement that will bite anyone migrating in a hurry:

Opus 4.7 is substantially better at following instructions. Interestingly, this means that prompts written for earlier models can sometimes now produce unexpected results: where previous models interpreted instructions loosely or skipped parts entirely, Opus 4.7 takes the instructions literally.

Translation: every prompt library tuned to squeeze performance out of Opus 4.6 probably has a few lines where the old model was politely ignoring you. 4.7 won't. Re-tune your prompts and harnesses before you call anything a regression.

What the early partners are saying

The testimonial wall Anthropic shipped with 4.7 is unusually long — 28 partners ranging from Cursor and Replit to Vercel, Databricks, Notion, and Harvey. A few quotes stand out:

Partner	Claim
Notion	+14% on multi-step workflows, fewer tokens, a third of the tool errors of Opus 4.6
Databricks	21% fewer errors on OfficeQA Pro vs. Opus 4.6
Harvey	90.9% substantive accuracy on BigLaw Bench at high effort
Hex	Low-effort Opus 4.7 ≈ medium-effort Opus 4.6
Factory	+10–15% task success on Factory Droids, fewer tool errors
CodeRabbit	Recall improved >10% on PR reviews while precision held

The consistent theme: long-running, multi-step agent work. Opus 4.7 doesn't give up halfway, loops less, and recovers from tool failures that used to stop Opus cold. That's the actual unlock, more than any single benchmark number.

The Mythos shadow

You can't read the 4.7 announcement without noticing the absence. Anthropic keeps pointing at Mythos Preview — a model they say is stronger on capability and safer on alignment — and then saying you can't use it. Every chart in the blog post shows Mythos outperforming 4.7. Every safety chart shows Mythos with lower misaligned-behavior rates.

This is the second time in six months Anthropic has released a model while publicly admitting it has a better one in the drawer. The company's position — that the frontier needs real-world cyber safeguards before a Mythos-class release — is reasonable. But it's a sharp reminder that the public-facing Claude lineup is now intentionally a step behind the frontier. Competitors will notice.

The Bottom Line

Claude Opus 4.7 is a real upgrade, not a version-number shuffle. For coding agents and multi-step automation, it is the strongest generally available model on the market today — narrowly, but measurably. If you're running Claude Code, Cursor, Devin, Notion Agent, or anything that calls tools in a loop, you should test xhigh this week.

Two things to do before you migrate: first, measure token usage on real traffic because the new tokenizer shifts the bill in non-obvious ways; second, re-tune any prompt that previously relied on 4.6 "loosely interpreting" your instructions — 4.7 will do exactly what you asked, which isn't always what you wanted.

And keep an eye on Mythos. The next Anthropic release cycle is going to be the interesting one.

claude anthropic llm ai-news coding swe-bench

More in AI News

AI News

Gemma 4 12B: Google's Encoder-Free Multimodal Laptop Model

Google released Gemma 4 12B on June 3, 2026, a multimodal open model with an encoder-free architecture that feeds vision and audio directly into the LLM backbone. It runs locally on 16GB of memory, approaches the 26B MoE on benchmarks, uses Multi-Token Prediction drafters for low latency, and ships under Apache 2.0 with broad tooling support.

By Sarah Chen · 5 min · Jun 9, 2026

AI News

MAI-Code-1-Flash: Microsoft's Lean Coding Model Hits Copilot

Microsoft launched MAI-Code-1-Flash on June 2, 2026, a lightweight, agentic coding model built end-to-end in-house and rolling out to GitHub Copilot users in VS Code. It outperforms Claude Haiku 4.5 across four coding benchmarks (including 51.2% vs 35.2% on SWE-Bench Pro) while using up to 60% fewer tokens, signaling Microsoft's push for AI independence from OpenAI.

By Sarah Chen · 5 min · Jun 6, 2026

AI News

DeepSeek V4-Pro: 75% Price Cut Becomes Permanent

On May 22, 2026, DeepSeek made its 75% promotional discount on V4-Pro permanent rather than letting it expire May 31. New permanent rates: $0.435/M input, $0.87/M output, $0.003625/M cache hit. That puts V4-Pro output roughly 34x cheaper than GPT-5.5 and 17x cheaper than Claude Opus 4.7, while landing within 3-7 points on coding and reasoning benchmarks. The underrated detail is the cache-hit price, which can cut input cost ~88% for agents with stable prefixes. Teams should re-run their build math and route the easy majority of traffic to V4-Pro.

By Sarah Chen · 5 min · Jun 1, 2026